Thursday, 26 October 2017

algorithm - Time performance in Generating very large text file in Python

itemprop="text">

I need to generate a very large text
file. Each line has a simple
format:



Seq_numnum_val
12343234
759


Let's assume I am
going to generate a file with 100million lines.
I tried 2 approaches and
surprisingly they are giving very different time performance.





  1. For
    loop over 100m. In each loop I make short string of
    seq_numnum_val, and then I write that to a file.

    This approach takes a lot of
    time.



    ## APPROACH 1 
    for
    seq_id in seq_ids:
    num_val=rand()
    line=seq_id+'
    '+num_val

    data_file.write(line)


  2. For
    loop over 100m. In each loop I make short string of
    seq_numnum_val, and then I append this to a
    list.
    When loop finishes, I iterate over list items and write each item to a
    file.
    This approach takes far less time.



    ## APPROACH 2

    data_lines=list()
    for seq_id in seq_ids:

    num_val=rand()
    l=seq_id+' '+num_val


    data_lines.append(l)
    for line in data_lines:

    data_file.write(line)



Note
that:




  • Approach 2 has 2
    loops instead of 1 loop.


  • I write to file in
    loop for both approach 1 and approach 2. So this step must be same for
    both.



So approach 1 must
take less time. Any hints what I am missing?


class="post-text" itemprop="text">
class="normal">Answer




Considering APPROACH 2, I think I can assume
you have the data for all the lines (or at least in big chunks)
before you need to write it to the
file.



The other answers are great and it was
really formative to read them, but both focused on optimizing the file writing or
avoiding the first for loop replacing with list comprehension (that is known to be
faster).



They missed the fact that you are
iterating in a for loop to write the file, which is not really
necessary.




Instead of doing that, by
increasing the use of memory (in this case is affordable, since a 100 million line file
would be about 600 MB), you can create just one string in a more efficient way by using
the formatting or join features of python str, and then write the big string to the
file. Also relying on list comprehension to get the data to be
formatted.



With loop1 and loop2 of @Tombart 's
answer, I get elapsed time 0:00:01.028567 and
elapsed time 0:00:01.017042,
respectively.



While with this
code:



start =
datetime.now()

data_file = open('file.txt',
'w')

data_lines = ( '%i %f\n'%(seq_id, random.random())

for seq_id in xrange(0, 1000000) )
contents =
''.join(data_lines)
data_file.write(contents)

end =
datetime.now()
print("elapsed time %s" % (end -
start))


I get
elapsed time 0:00:00.722788 which is about a 25%
faster.




Notice that
data_lines is a generator expression, so the list is not really
stored in memory, and the lines are generated and consumed on demand by the
join method. This implies the only variable that is
significantly occupying memory is contents. This also reduces
slightly the running times.



If the text is to
large to do all the work in memory, you can always separate in chunks. That is,
formatting the string and writing to the file every million lines or
so.



Conclusions:




  • Always try to do list
    comprehension instead of plain for loops (list comprehension is even faster than
    filter for filtering lists href="https://stackoverflow.com/questions/3013449/list-filtering-list-comprehension-vs-lambda-filter">see
    here).

  • If possible by memory or implementation
    constraints, try to create and encode string contents at once, using the
    format or join
    functions.


  • If possible and the code remains
    readable, use built-in functions to avoid for loops. For
    example, using extend function of a list instead of iterating
    and using append. In fact, both previous points can be seen as
    examples of this
    remark.



Remark.
Although
this answer can be considered useful on its own, it does not completely address the
question, which is why the two loops option in the question seems
to run faster in some environments. For that, perhaps the @Aiken Drum's answer below can
bring some light on that matter.


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...