Project Gutenberg
The Project Gutenberg is a project gathering public domain books in different language, its web site is http://www.gutenberg.org. The purpose of this project is to create a sustainable solution to create a ZIM file providing the Gutenberg project ebooks in the similar manner like gutenberg.org
Goals
- A script (python/perl/nodejs) able to create quickly a ZIM file with all books in all languages.
- The data should be scraped from www.gutemberg.org.
- The texts should be available in HTML and EPUB.
- The ZIM should provide a simple filtering/search solution to find content (by author, language, title, ....)
One way to achieve it
- Retrieve the list of books is published by the Gutenberg project in XML/RDF format
- Parse the XML/RDF and put the data in a structured manner (memory or local DB)
- Download the necessary HTML+EPUB data from Gutemberg.org based on the XML/RDF Catalog in a target directory
- Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried)
- Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
- Run zimwriterfs to create the corresponding ZIM file of your target directory
First investigation results
Work done by didier@google.com and cniekel@google.com
The RDF index is at http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book.
Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory.
- Getting data
Gutenberg supports rsync ( rsync -av --del ftp@ftp.ibiblio.org::gutenberg /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB. That was source, the generated data: rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
If I cd gutenberg-generated, there is stuff like: ./10126/pg10126.rdf ./10126/pg10126.mobi ./10126/pg10126.xapian ./10126/pg10126.plucker.pdb ./10126/pg10126.qrcode.mobile.png ./10126/pg10126.log ./10126/pg10126.txt.utf8 ./10126/pg10126.qrcode.desktop.png ./10126/pg10126.epub ./10126/pg10126.txt.utf8.gzip ./10126/pg10126.converter.log ./10126/pg10126.qrcode.png ./10126/pg10126.qioo.jar
To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient.
So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught.
So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient
- Converting data
...
- Zimming
one of the rdflib readers for python: https://github.com/RDFLib/rdflib.git
The best Goobuntu packaged option seems to be: python-rdflib
Building zimwriterfs
sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake git clone git://git.code.sf.net/p/kiwix/other kiwix-other cd kiwix-other/zimwriterfs/ ./autogen.sh ./configure make sudo make install
gives you
/usr/local/bin/zimwriterfs
Driver steps
rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
copy-data-to-outputdir
build-index