Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 12111

Is Python's gzip.writer known to be slow?

$
0
0

I've written a script to process CSV files for my employer. The processing is CPU-bound and the files are large. For example, a 3+ GB of input yields 6+ GB of output.

On my machine the transformation of that text-file takes almost 16 minutes (which itself is kinda long, but I'm using the stock csv-module), about 30 seconds of which is spent by the OS (writing the output).

For kicks, I added a feature of transparent compression:

if name.endswith('.gz'):    import gzip    return gzip.GzipFile(name, mode, 9, fd)return fd

When using compression, the run time jumps to over an hour -- although the sys-time halves, because there is a lot less to write.

The jump is understandable, but the scale of it is not -- if I simply run gzip -9 on the uncompressed output file, it takes only about 13 minutes.

I can understand, that gzip may win something by using bigger buffers, etc. -- but embedding compression in my script should be able to benefit from less data-copying. And yet, it loses worse than 2:1: 16 minutes to transform + 13 minutes to compress vs. 61 minutes to do both in one go.

Why is there such a large discrepancy? Is the zlib/gzip code in Python-2.x known to be slow? Should Python-3 be better in this regard -- it is significantly worse in uncompressed processing...


Viewing all articles
Browse latest Browse all 12111

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>