algorithm - Time performance in Generating very large text file in
Python
itemprop="text">
I need to generate a very large text
file. Each line has a simple
format:
Seq_numnum_val
12343234
759
Let's assume I am
going to generate a file with 100million lines.
I tried 2 approaches and
surprisingly they are giving very different time performance.
For
loop over 100m. In each loop I make short string of
seq_numnum_val
, and then I write that to a file.
This approach takes a lot of
time.
## APPROACH 1
for
seq_id in seq_ids:
num_val=rand()
line=seq_id+'
'+num_val
data_file.write(line)
For
loop over 100m. In each loop I make short string of
seq_numnum_val
, and then I append this to a
list.
When loop finishes, I iterate over list items and write each item to a
file.
This approach takes far less time.
## APPROACH 2
data_lines=list()
for seq_id in seq_ids:
num_val=rand()
l=seq_id+' '+num_val
data_lines.append(l)
for line in data_lines:
data_file.write(line)
Note
that:
- Approach 2 has 2
loops instead of 1 loop.
- I write to file in
loop for both approach 1 and approach 2. So this step must be same for
both.
So approach 1 must
take less time. Any hints what I am missing?
class="post-text" itemprop="text">
Considering APPROACH 2, I think I can assume
you have the data for all the lines (or at least in big chunks)
before you need to write it to the
file.
The other answers are great and it was
really formative to read them, but both focused on optimizing the file writing or
avoiding the first for loop replacing with list comprehension (that is known to be
faster).
They missed the fact that you are
iterating in a for loop to write the file, which is not really
necessary.
Instead of doing that, by
increasing the use of memory (in this case is affordable, since a 100 million line file
would be about 600 MB), you can create just one string in a more efficient way by using
the formatting or join features of python str, and then write the big string to the
file. Also relying on list comprehension to get the data to be
formatted.
With loop1 and loop2 of @Tombart 's
answer, I get elapsed time 0:00:01.028567
and
elapsed time 0:00:01.017042
,
respectively.
While with this
code:
start =
datetime.now()
data_file = open('file.txt',
'w')
data_lines = ( '%i %f\n'%(seq_id, random.random())
for seq_id in xrange(0, 1000000) )
contents =
''.join(data_lines)
data_file.write(contents)
end =
datetime.now()
print("elapsed time %s" % (end -
start))
I get
elapsed time 0:00:00.722788
which is about a 25%
faster.
Notice that
data_lines
is a generator expression, so the list is not really
stored in memory, and the lines are generated and consumed on demand by the
join
method. This implies the only variable that is
significantly occupying memory is contents
. This also reduces
slightly the running times.
If the text is to
large to do all the work in memory, you can always separate in chunks. That is,
formatting the string and writing to the file every million lines or
so.
Conclusions:
- Always try to do list
comprehension instead of plain for loops (list comprehension is even faster than
filter
for filtering lists href="https://stackoverflow.com/questions/3013449/list-filtering-list-comprehension-vs-lambda-filter">see
here).
- If possible by memory or implementation
constraints, try to create and encode string contents at once, using the
format
or join
functions.
- If possible and the code remains
readable, use built-in functions to avoid for
loops. For
example, using extend
function of a list instead of iterating
and using append
. In fact, both previous points can be seen as
examples of this
remark.
Remark.
Although
this answer can be considered useful on its own, it does not completely address the
question, which is why the two loops option in the question seems
to run faster in some environments. For that, perhaps the @Aiken Drum's answer below can
bring some light on that matter.
No comments:
Post a Comment