These days I care a bit about zlib deflate speed for work. Last week CloudFlare made some noise about having 'rewritten gzip' for massive performance improvements. Which was perhaps a bit exaggerated, since it turns out to mostly be vectorization changes similar to those included in the earlier zlib patches from Intel.
The changes from Intel are somewhat more extensive, since they include two completely new deflate strategies ('quick' for use at level 1, 'medium' for use at levels 4-6). There's a higher risk of bugs with such changes than with 'just' vectorizing existing scalar code. But those new strategies can be disabled at compile time, so even if that's a worry it would've been nice if the latter changes had been built on top of the earlier ones. (The most likely reason is probably not knowing about it, but then it becomes a case of a single web search saving a few days of programming. Intel's changes were pretty well publicized last year and have been going through the zlib mailing list).
Anyway, since there are now two separate zlib performance forks, I thought it'd be useful to compare them a bit.
The summary is that CloudFlare's fork is much faster with decompression and a little faster with high compression levels, while Intel's is faster with low and medium compression levels.
It seems likely that one could get the best of both worlds. At least a superficial analysis suggests that many of the changes don't conflict in principle (even if they often conflict in practice, due to messing with the same areas in the code). For example the decompression speedups in CloudFlare's version appear to come mostly from a SSE-optimized CRC32 implementation. Intel's version also includes similar optimized CRC32 code, but as a separate entry point used only for compression (doing a combined memory copy and checksum).
As for production use, the new compression strategies in Intel's version had some early bugs. They are fixed now, and at least in our internal testing we haven't seen any new problems. So we'll probably switch to it in our next release. CloudFlare's changes are more conservative in that sense. Unfortunately there's a license time-bomb in that version, since it now includes a GPLv2 source file. That matters to me, but of course won't be an issue for some.
For more details, continue reading.
The test script
checks out and compiles the specified
zlib versions, compiles them, and then uses the generated
minigzip binaries to compress and decompress a few test files,
possibly at different compression levels. When testing decompression,
all zlib versions are tested using the same compressed file, generated
using the baseline version at the default compression level.
Each benchmark run (of 10 compressions or 200 decompressions) was repeated 5 times, with different zlib versions being interleaved to eliminate any systematic biases (e.g. system load or thermal throttling). The numbers reported in the following tables are the mean and the standard error of the execution times.
The test corpus included English text in HTML format (The Count of Monte Crisco), a x86-64 executable, a basically uncompressable jpeg file, and the pixel data from a RGB image after png compression filtering had been applied.
The benchmarks were compiled using gcc 4.8.2 and run on a i7-4770, using the following command line.
CFLAGS='-msse4.2 -mpclmul -O3' perl bench.pl --output-format=json --output-file=results.json --compress-levels=1,3,5,7,9 --compress-iters=10 --decompress-iters=200 --recompile
(Note the CFLAGS if you want to rerun the test. At least on my machine the configure script in CloudFlare's version doesn't autodetect SSE 4.2 / PCLMULQDQ support without those flags.)
Compression level 1
This is essentially a comparison between the original 'fast' deflate strategy and the new 'quick' one. Which is to say, these results aren't very comparable at all. Basically the Intel version adds a completely new point to the performance vs. compression ratio curve (trading a fairly big amount of compression for an even bigger speedup). It seems like a worthwhile option to have.
|compress executable -1 (10 iterations)|
|compress html -1 (10 iterations)|
|compress jpeg -1 (10 iterations)|
|compress pngpixels -1 (10 iterations)|
Compression level 3
This is very close to a like-for-like comparison, since all library versions are using the same strategy ('fast') with the same parameters. The results aren't identical, but the achieved compression ratios are essentially the same. Intel's version is marginally faster.
|compress executable -3 (10 iterations)|
|compress html -3 (10 iterations)|
|compress jpeg -3 (10 iterations)|
|compress pngpixels -3 (10 iterations)|
Compression level 5
This is a comparison of the new 'medium' strategy to the old 'slow'. The compression ratios are essentially the same. Intel's version is significantly faster on the compressible data, slower on the uncompressable data.
|compress executable -5 (10 iterations)|
|compress html -5 (10 iterations)|
|compress jpeg -5 (10 iterations)|
|compress pngpixels -5 (10 iterations)|
Compression level 7
Another like-for-like comparison, this time with the 'slow' strategy. Both optimized versions are noticeably faster than the original, but CloudFlare's version is marginally faster.
|compress executable -7 (10 iterations)|
|compress html -7 (10 iterations)|
|compress jpeg -7 (10 iterations)|
|compress pngpixels -7 (10 iterations)|
Compression level 9
Basically the same as level 7.
|compress executable -9 (10 iterations)|
|compress html -9 (10 iterations)|
|compress jpeg -9 (10 iterations)|
|compress pngpixels -9 (10 iterations)|
CloudFlare's version decompresses a lot faster than either of the other two.
|decompress executable (200 iterations)|
|decompress html (200 iterations)|
|decompress jpeg (200 iterations)|
|decompress pngpixels (200 iterations)|