Difference between revisions of "Project Gutenberg"

Revision as of 15:27, 18 June 2014

The Project Gutenberg is a project gathering public domain books in different language, its web site is http://www.gutenberg.org. The purpose of this project is to create a sustainable solution to create a ZIM file providing the Gutenberg project ebooks in the similar manner like gutenberg.org

Goals

A script (python/perl/nodejs) able to create quickly a ZIM file with all books in all languages.
The data should be scraped from www.gutemberg.org.
The texts should be available in HTML and EPUB.
The ZIM should provide a simple filtering/search solution to find content (by author, language, title, ....)

One way to achieve it

Retrieve the list of books is published by the Gutenberg project in XML/RDF format
Parse the XML/RDF and put the data in a structured manner (memory or local DB)
Download the necessary HTML+EPUB data from Gutemberg.org based on the XML/RDF Catalog in a target directory
Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried)
Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
Run zimwriterfs to create the corresponding ZIM file of your target directory

First investigation results

Work done by didier@google.com and cniekel@google.com

The RDF index is at http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2

wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book.

Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory.

Getting data

Gutenberg supports rsync ( rsync -av --del ftp@ftp.ibiblio.org::gutenberg /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB. That was source, the generated data: rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated

If I cd gutenberg-generated, there is stuff like: ./10126/pg10126.rdf ./10126/pg10126.mobi ./10126/pg10126.xapian ./10126/pg10126.plucker.pdb ./10126/pg10126.qrcode.mobile.png ./10126/pg10126.log ./10126/pg10126.txt.utf8 ./10126/pg10126.qrcode.desktop.png ./10126/pg10126.epub ./10126/pg10126.txt.utf8.gzip ./10126/pg10126.converter.log ./10126/pg10126.qrcode.png ./10126/pg10126.qioo.jar

To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient.

So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught.

So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient

Converting data

...

Zimming

one of the rdflib readers for python: https://github.com/RDFLib/rdflib.git

The best Goobuntu packaged option seems to be: python-rdflib

Building zimwriterfs

sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake git clone git://git.code.sf.net/p/kiwix/other kiwix-other cd kiwix-other/zimwriterfs/ ./autogen.sh ./configure make sudo make install

gives you

/usr/local/bin/zimwriterfs

Driver steps

rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated

copy-data-to-outputdir build-index

Difference between revisions of "Project Gutenberg"

Revision as of 15:27, 18 June 2014

Goals

One way to achieve it

First investigation results

Navigation menu

Search

@@ Line 14: / Line 14: @@
 # Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
 # Run zimwriterfs to create the corresponding ZIM file of your target directory
+== First investigation results ==
+Work done by didier@google.com and cniekel@google.com
+The RDF index is at http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
+wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book.
+Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory.
+;Getting data
+Gutenberg supports rsync ( rsync -av --del ftp@ftp.ibiblio.org::gutenberg /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB.
+That was source, the generated data:
+rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
+If I cd gutenberg-generated, there is stuff like:
+./10126/pg10126.rdf
+./10126/pg10126.mobi
+./10126/pg10126.xapian
+./10126/pg10126.plucker.pdb
+./10126/pg10126.qrcode.mobile.png
+./10126/pg10126.log
+./10126/pg10126.txt.utf8
+./10126/pg10126.qrcode.desktop.png
+./10126/pg10126.epub
+./10126/pg10126.txt.utf8.gzip
+./10126/pg10126.converter.log
+./10126/pg10126.qrcode.png
+./10126/pg10126.qioo.jar
+To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient.
+So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught.
+So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient
+;Converting data
+...
+;Zimming
+one of the rdflib readers for python: https://github.com/RDFLib/rdflib.git
+The best Goobuntu packaged option seems to be:
+python-rdflib
+Building zimwriterfs
+sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake
+git clone git://git.code.sf.net/p/kiwix/other kiwix-other
+cd kiwix-other/zimwriterfs/
+./autogen.sh
+./configure
+make
+sudo make install
+gives you
+/usr/local/bin/zimwriterfs
+Driver steps
+rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
+copy-data-to-outputdir
+build-index