python - Large, persistent DataFrame in pandas

Tuesday, 26 December 2017

python - Large, persistent DataFrame in pandas

itemprop="text">

I am exploring switching to python and
pandas as a long-time SAS user.

However, when
running some tests today, I was surprised that python ran out of memory when trying to
pandas.read_csv() a 128mb csv file. It had about 200,000 rows
and 200 columns of mostly numeric data.

With
SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

Is there something analogous in
pandas?

I
regularly work with large files and do not have access to a distributed computing
network.

Answer

In principle it shouldn't run out of memory,
but there are currently memory problems with read_csv on large
files caused by some complex Python internal issues (this is vague but it's been known
for a long time: href="http://github.com/pydata/pandas/issues/407">http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution
(here's a tedious one: you could transcribe the file row-by-row into a pre-allocated
NumPy array or memory-mapped file--np.mmap), but it's one I'll
be working on in the near future. Another solution is to read the file in smaller pieces
(use iterator=True, chunksize=1000) then concatenate then with
pd.concat. The problem comes in when you pull the entire text
file into memory in one big slurp.

Blog

Tuesday, 26 December 2017

python - Large, persistent DataFrame in pandas

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file