I've written a script to process CSV files for my employer. The processing is CPU-bound and the files are large. For example, a 3+ GB of input yields 6+ GB of output.
On my machine the transformation of that text-file takes almost 16 minutes (which itself is kinda long, but I'm using the stock csv
-module), about 30 seconds of which is spent by the OS (writing the output).
For kicks, I added a feature of transparent compression:
if name.endswith('.gz'): import gzip return gzip.GzipFile(name, mode, 9, fd)return fd
When using compression, the run time jumps to over an hour -- although the sys-time halves, because there is a lot less to write.
The jump is understandable, but the scale of it is not -- if I simply run gzip -9
on the uncompressed output file, it takes only about 13 minutes.
I can understand, that gzip
may win something by using bigger buffers, etc. -- but embedding compression in my script should be able to benefit from less data-copying. And yet, it loses worse than 2:1: 16 minutes to transform + 13 minutes to compress vs. 61 minutes to do both in one go.
Why is there such a large discrepancy? Is the zlib/gzip code in Python-2.x known to be slow? Should Python-3 be better in this regard -- it is significantly worse in uncompressed processing...