Abstract
I download wikipedia database dumps on a semi-regular basis in case of transient internet outages.
The file enwiki-latest-pages-articles
of 2014-02-21 arrives as an xml.bz2. Is it worth recompressing it?
Results
size | pct | ctime | dtime | |
---|---|---|---|---|
uncompressed | 47 078 573 765 | 100.0% | - | - |
quicklz | 19 691 063 966 | 41.8% | 13m | |
wikitaxi | 17 131 739 126 | 36.4% | 2h24m | - |
rar-fastest | 13 817 152 272 | 29.3% | 14m | |
xz0 | 12 438 061 032 | 26.4% | 1h11m | |
bz2 | 10 587 333 991 | 22.5% | - | 50 minutes (WinRAR) = 3.3 MiB/sec |
7z-ppmd[1] | 9 182 678 359 | 19.5% | 2h16m | |
xz9 | 8 222 669 352 | 17.5% | 15h27m |
Settings
test conditions
- Reading and writing to the same drive (5400rpm 500gb WD green), drive otherwise unused
- Time measurements to the nearest minute only
- scientific rigor shares a word with rigor mortis
wikitaxi
- start time 11.43, estimated duration 2h5m
- importer version 1.3.0
- RAM for database: 737MB (default)
7z-ppmd
- 7zip 9.2.2beta x64
- ppmd compressor
- dictionary 16mb / word 6 / solid block size 2gb / 1 thread
quicklz
- qupress64 version 1.1 built against quicklz 1.4.1
- T2 (default), L1 (default)
- could increase the threads but i suspect it was limited by disk speed, could increase the compression level to compensate
rar-fastest
- WinRAR 5.0.0 beta 3 x64 (doing this made me realise it needed an update)
- Classic RAR format, compression level fastest, dictionary size 4096kb
- once again probably disk bottlenecked
xz0
- xz 5.0.5 win64
- compression level 0
- xz is only single threaded
xz9
- xz 5.0.5 win64
- compression level 9
- xz is only single threaded
- started 18.41, finished 10.08 next day