Jan-Philipp Litza

My offline Wikipedia workflow 25. August 2015

While looking for services that I could offer in our local Freifunk network, I stumbled across the modern way to use Wikipedia offline: Kiwix.

So, in order to read Wikipedia offline, one only has to download one of the ZIM files provided by the Kiwix project and load it into one of the variants of Kiwix. For Freifunk I set up an instance of the Kiwix server, but I also have the Kiwix app on my Android phone.

But in order to search the offline version like the online version for keywords and not only page titles, one has to have a search index. Obviously you can build the index yourself from the ZIM file. But this takes a looong time. So at the same page, there are downloads including the actual content (ZIM file), the index, and packed versions of Kiwix (which I don’t need).

So my usual workflow whenever there is a new dump available is as follows:

  1. Fetch the zipped index package via torrent (and let it seed until a new one is available)
  2. Extract the ZIM file from it. Now, the ZIP file contains split ZIM files (.zimaa, .zimab, …), but I like the big, unified version better. Also, I want to seed the raw ZIM file as well as the packed ZIP file. Luckily, the split ZIM files can simply be concatenated to get the unified one. Even better: unzip can do the concatenation while unpacking, saving one step and several gigabytes of disk writes and reads! All I have to do is

    unzip -p kiwix-0.9+wikipedia_de_all_2015-08.zip data/content/'*' > ../content/wikipedia_de_all_2015-08.zim
  3. Let my torrent program check the consistency of the ZIM file by telling it to seed it using the ZIM-file-only torrent from the Kiwix page (and letting it seed, of course)
  4. Unpack the index:

    unzip kiwix-0.9+wikipedia_de_all_2015-08.zip data/index/'*' -d ../../
  5. Update my library.xml with the one contained in the ZIP file. I use symlinks like wikipedia_de.zim -> wikipedia_de_all_2015-05.zim for the actual ZIM files in order to have permanent links in the webserver, and have German as well as English Wikipedia dumps, so some manual work is needed here.