itemprop="text">
I am exploring switching to python and
pandas as a long-time SAS user.
However, when
running some tests today, I was surprised that python ran out of memory when trying to
pandas.read_csv()
a 128mb csv file. It had about 200,000 rows
and 200 columns of mostly numeric data.
With
SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in
pandas
?
I
regularly work with large files and do not have access to a distributed computing
network.
In principle it shouldn't run out of memory,
but there are currently memory problems with read_csv
on large
files caused by some complex Python internal issues (this is vague but it's been known
for a long time: href="http://github.com/pydata/pandas/issues/407">http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution
(here's a tedious one: you could transcribe the file row-by-row into a pre-allocated
NumPy array or memory-mapped file--np.mmap
), but it's one I'll
be working on in the near future. Another solution is to read the file in smaller pieces
(use iterator=True, chunksize=1000
) then concatenate then with
pd.concat
. The problem comes in when you pull the entire text
file into memory in one big slurp.
No comments:
Post a Comment