Difference between revisions of "Project Gutenberg"

Jump to navigation Jump to search
no edit summary
(Created page with "The '''Project Gutenberg''' is a project gathering public domain books in different language, its web site is http://www.gutenberg.org. The purpose of this project is to creat...")
 
Line 14: Line 14:
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
# Run zimwriterfs to create the corresponding ZIM file of your target directory
# Run zimwriterfs to create the corresponding ZIM file of your target directory
== First investigation results ==
Work done by didier@google.com and cniekel@google.com
The RDF index is at http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book.
Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory.
;Getting data
Gutenberg supports rsync ( rsync -av --del ftp@ftp.ibiblio.org::gutenberg /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB.
That was source, the generated data:
rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
If I cd gutenberg-generated, there is stuff like:
./10126/pg10126.rdf
./10126/pg10126.mobi
./10126/pg10126.xapian
./10126/pg10126.plucker.pdb
./10126/pg10126.qrcode.mobile.png
./10126/pg10126.log
./10126/pg10126.txt.utf8
./10126/pg10126.qrcode.desktop.png
./10126/pg10126.epub
./10126/pg10126.txt.utf8.gzip
./10126/pg10126.converter.log
./10126/pg10126.qrcode.png
./10126/pg10126.qioo.jar
To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient.
So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught.
So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient
;Converting data
...
;Zimming
one of the rdflib readers for python: https://github.com/RDFLib/rdflib.git
The best Goobuntu packaged option seems to be:
python-rdflib
Building zimwriterfs
sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake
git clone git://git.code.sf.net/p/kiwix/other kiwix-other
cd kiwix-other/zimwriterfs/
./autogen.sh
./configure
make
sudo make install
gives you
/usr/local/bin/zimwriterfs
Driver steps
rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
copy-data-to-outputdir
build-index

Navigation menu