Recompressing enwiki

Posted on
compression benchmark

Abstract

I download wikipedia database dumps on a semi-regular basis in case of transient internet outages.

The file enwiki-latest-pages-articles of 2014-02-21 arrives as an xml.bz2. Is it worth recompressing it?

Results

size pct ctime dtime
uncompressed 47 078 573 765 100.0% - -
quicklz 19 691 063 966 41.8% 13m
wikitaxi 17 131 739 126 36.4% 2h24m -
rar-fastest 13 817 152 272 29.3% 14m
xz0 12 438 061 032 26.4% 1h11m
bz2 10 587 333 991 22.5% - 50 minutes (WinRAR) = 3.3 MiB/sec
7z-ppmd[1] 9 182 678 359 19.5% 2h16m
xz9 8 222 669 352 17.5% 15h27m

Settings

test conditions

  • Reading and writing to the same drive (5400rpm 500gb WD green), drive otherwise unused
  • Time measurements to the nearest minute only
  • scientific rigor shares a word with rigor mortis

wikitaxi

  • start time 11.43, estimated duration 2h5m
  • importer version 1.3.0
  • RAM for database: 737MB (default)

7z-ppmd

  • 7zip 9.2.2beta x64
  • ppmd compressor
  • dictionary 16mb / word 6 / solid block size 2gb / 1 thread

quicklz

  • qupress64 version 1.1 built against quicklz 1.4.1
  • T2 (default), L1 (default)
  • could increase the threads but i suspect it was limited by disk speed, could increase the compression level to compensate

rar-fastest

  • WinRAR 5.0.0 beta 3 x64 (doing this made me realise it needed an update)
  • Classic RAR format, compression level fastest, dictionary size 4096kb
  • once again probably disk bottlenecked

xz0

  • xz 5.0.5 win64
  • compression level 0
  • xz is only single threaded

xz9

  • xz 5.0.5 win64
  • compression level 9
  • xz is only single threaded
  • started 18.41, finished 10.08 next day