Difference between revisions of "Project Gutenberg"

Latest revision as of 12:17, 23 December 2014

The Project Gutenberg is a project gathering public domain books in different language, its web site is http://www.gutenberg.org. The purpose of this project is to create a sustainable solution to create a ZIM file providing the Gutenberg project ebooks in the similar manner like gutenberg.org

Goals

A script (python/perl/nodejs) able to create quickly a ZIM file with all books in all languages.
The data should be scraped from www.gutenberg.org.
The texts should be available in HTML and EPUB.
The ZIM should provide a simple filtering/search solution to find content (by author, language, title, ....)

One way to achieve it

Retrieve the list of books is published by the Gutenberg project in XML/RDF format
Parse the XML/RDF and put the data in a structured manner (memory or local DB)
Download the necessary HTML+EPUB data from Gutenberg.org based on the XML/RDF Catalog in a target directory
Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried)
Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
Run zimwriterfs to create the corresponding ZIM file of your target directory

First investigation results

Work done by didier chez google.com and cniekel chez google.com

The RDF index is at http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2

wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book.

Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory.

Getting data

Gutenberg supports rsync ( rsync -av --del ftp@ftp.ibiblio.org::gutenberg /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB. That was source, the generated data: rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated

If I cd gutenberg-generated, there is stuff like: ./10126/pg10126.rdf ./10126/pg10126.mobi ./10126/pg10126.xapian ./10126/pg10126.plucker.pdb ./10126/pg10126.qrcode.mobile.png ./10126/pg10126.log ./10126/pg10126.txt.utf8 ./10126/pg10126.qrcode.desktop.png ./10126/pg10126.epub ./10126/pg10126.txt.utf8.gzip ./10126/pg10126.converter.log ./10126/pg10126.qrcode.png ./10126/pg10126.qioo.jar

To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient.

So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught.

So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient

Converting data

...

Zimming

one of the rdflib readers for python: https://github.com/RDFLib/rdflib.git

The best Goobuntu packaged option seems to be: python-rdflib

Building zimwriterfs

sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake git clone git://git.code.sf.net/p/kiwix/other kiwix-other cd kiwix-other/zimwriterfs/ ./autogen.sh ./configure make sudo make install

gives you

/usr/local/bin/zimwriterfs

Driver steps

rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated

copy-data-to-outputdir build-index

Scripting Stages

[one-time only for mirror] rsync all data to zimfarm.kiwix.org/gutenberg
Download & Extract rdf-files.tar.bz2
Loop through folder/files and parse RDF
1. Fill the Database with all data
Query the database to reflect filters and get list of books
Download the books based on filters (formats, languages)
[MLB] Generate a static folder repository of all ePUB files
Generate zimwriterfs-friendly folder of static HTML files based on templates and list of books.
Generate zim file from static folder

Prepare the templates
- Article template
- HomePage template
- Index template?

Next steps

One of the problem is that even on Gutenberg, we don't have all the most important books of the French litterature. We should help to fix this. Here is the coordination page.

Difference between revisions of "Project Gutenberg"

Latest revision as of 12:17, 23 December 2014

Contents

Goals

One way to achieve it

First investigation results

Scripting Stages

Next steps

Others

List of books

See also

Navigation menu

Search

@@ Line 3: / Line 3: @@
 == Goals ==
 * A script (python/perl/nodejs) able to create quickly a ZIM file with all books in all languages.
-* The data should be scraped from www.gutemberg.org.
+* The data should be scraped from www.gutenberg.org.
 * The texts should be available in HTML and EPUB.
 * The ZIM should provide a simple filtering/search solution to find content (by author, language, title, ....)
@@ Line 10: / Line 10: @@
 # Retrieve the list of books is published by the Gutenberg project in [http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 XML/RDF format]
 # Parse the XML/RDF and put the data in a structured manner (memory or local DB)
-# Download the necessary HTML+EPUB data from Gutemberg.org based on the XML/RDF Catalog in a target directory
+# Download the necessary HTML+EPUB data from Gutenberg.org based on the XML/RDF Catalog in a target directory
 # Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried)
 # Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
 # Run zimwriterfs to create the corresponding ZIM file of your target directory
+== First investigation results ==
+Work done by didier chez google.com and cniekel chez google.com
+The RDF index is at http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
+wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book.
+Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory.
+;Getting data
+Gutenberg supports rsync ( rsync -av --del ftp@ftp.ibiblio.org::gutenberg /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB.
+That was source, the generated data:
+rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
+If I cd gutenberg-generated, there is stuff like:
+./10126/pg10126.rdf
+./10126/pg10126.mobi
+./10126/pg10126.xapian
+./10126/pg10126.plucker.pdb
+./10126/pg10126.qrcode.mobile.png
+./10126/pg10126.log
+./10126/pg10126.txt.utf8
+./10126/pg10126.qrcode.desktop.png
+./10126/pg10126.epub
+./10126/pg10126.txt.utf8.gzip
+./10126/pg10126.converter.log
+./10126/pg10126.qrcode.png
+./10126/pg10126.qioo.jar
+To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient.
+So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught.
+So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient
+;Converting data
+...
+;Zimming
+one of the rdflib readers for python: https://github.com/RDFLib/rdflib.git
+The best Goobuntu packaged option seems to be:
+python-rdflib
+Building zimwriterfs
+sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake
+git clone git://git.code.sf.net/p/kiwix/other kiwix-other
+cd kiwix-other/zimwriterfs/
+./autogen.sh
+./configure
+make
+sudo make install
+gives you
+/usr/local/bin/zimwriterfs
+Driver steps
+rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated
+copy-data-to-outputdir
+build-index
+== Scripting Stages ==
+# [one-time only for mirror] rsync all data to zimfarm.kiwix.org/gutenberg
+# Download & Extract rdf-files.tar.bz2
+# Loop through folder/files and parse RDF
+## Fill the Database with all data
+# Query the database to reflect filters and get list of books
+# Download the books based on filters (formats, languages)
+# [MLB] Generate a static folder repository of all ePUB files
+# Generate zimwriterfs-friendly folder of static HTML files based on templates and list of books.
+# Generate zim file from static folder
+* Prepare the templates
+** Article template
+** HomePage template
+** Index template?
+== Next steps ==
+One of the problem is that even on Gutenberg, we don't have all the most important books of the French litterature. We should help to fix this. Here is the [[TOP 100 French ebooks to create|coordination page]].
+== Others ==
+* http://www.ebooksgratuits.com/opds/index.php
+* http://noslivres.net
+* http://fadedpage.com/
+== List of books ==
+* http://www.alalettre.com/fluctuat-10-livres-parfaits.php
+* https://fr.wikipedia.org/wiki/Biblioth%C3%A8que_id%C3%A9ale
+== See also ==
+* [http://lite4.framapad.org/p/hackathon-kiwix-lyon Framapad with some notes]
+* [[Project Gutenberg/description_fr|Grant detail]]
+* [https://github.com/kiwix/gutenberg Github repository]
+* Photos: [[:commons:File:Kiwix hackathon Lyon juillet 2014 extérieur.jpg|outdoor]], [[:commons:File:Kiwix hackathon Lyon juillet 2014 intérieur.jpg|geeking]]