Tell us your story
Tell us your story
How has offline Wikipedia affected you? The Wikimedia Foundation (the non-profit that supports Wikipedia) is looking for personal, diverse and inspiring stories about how offline Wikipedia affects the world. If you have a personal story that you would like to share, please contact: Thank you!

Project Gutenberg

From Kiwix
Revision as of 09:20, 9 July 2014 by Reg (talk | contribs)
Jump to: navigation, search

The Project Gutenberg is a project gathering public domain books in different language, its web site is The purpose of this project is to create a sustainable solution to create a ZIM file providing the Gutenberg project ebooks in the similar manner like


  • A script (python/perl/nodejs) able to create quickly a ZIM file with all books in all languages.
  • The data should be scraped from
  • The texts should be available in HTML and EPUB.
  • The ZIM should provide a simple filtering/search solution to find content (by author, language, title, ....)

One way to achieve it

  1. Retrieve the list of books is published by the Gutenberg project in XML/RDF format
  2. Parse the XML/RDF and put the data in a structured manner (memory or local DB)
  3. Download the necessary HTML+EPUB data from based on the XML/RDF Catalog in a target directory
  4. Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried)
  5. Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
  6. Run zimwriterfs to create the corresponding ZIM file of your target directory

First investigation results

Work done by didier chez and cniekel chez

The RDF index is at

wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book.

Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory.

Getting data

Gutenberg supports rsync ( rsync -av --del /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB. That was source, the generated data: rsync -av --del /var/www/gutenberg-generated

If I cd gutenberg-generated, there is stuff like: ./10126/pg10126.rdf ./10126/ ./10126/pg10126.xapian ./10126/pg10126.plucker.pdb ./10126/ ./10126/pg10126.log ./10126/pg10126.txt.utf8 ./10126/pg10126.qrcode.desktop.png ./10126/pg10126.epub ./10126/pg10126.txt.utf8.gzip ./10126/pg10126.converter.log ./10126/pg10126.qrcode.png ./10126/pg10126.qioo.jar

To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient.

So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught.

So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient

Converting data



one of the rdflib readers for python:

The best Goobuntu packaged option seems to be: python-rdflib

Building zimwriterfs

sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake git clone git:// kiwix-other cd kiwix-other/zimwriterfs/ ./ ./configure make sudo make install

gives you


Driver steps

rsync -av --del /var/www/gutenberg-generated

copy-data-to-outputdir build-index

Scripting Stages

  1. rsync all data to
  2. Download & Extract rdf-files.tar.bz2
  3. Loop through folder/files and parse RDF
    1. Fill the Database with all data
  4. Filter the database

See also