8,236
edits
(Created page with "The '''Project Gutenberg''' is a project gathering public domain books in different language, its web site is http://www.gutenberg.org. The purpose of this project is to creat...") |
|||
(16 intermediate revisions by 4 users not shown) | |||
Line 3: | Line 3: | ||
== Goals == | == Goals == | ||
* A script (python/perl/nodejs) able to create quickly a ZIM file with all books in all languages. | * A script (python/perl/nodejs) able to create quickly a ZIM file with all books in all languages. | ||
* The data should be scraped from www. | * The data should be scraped from www.gutenberg.org. | ||
* The texts should be available in HTML and EPUB. | * The texts should be available in HTML and EPUB. | ||
* The ZIM should provide a simple filtering/search solution to find content (by author, language, title, ....) | * The ZIM should provide a simple filtering/search solution to find content (by author, language, title, ....) | ||
Line 10: | Line 10: | ||
# Retrieve the list of books is published by the Gutenberg project in [http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 XML/RDF format] | # Retrieve the list of books is published by the Gutenberg project in [http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 XML/RDF format] | ||
# Parse the XML/RDF and put the data in a structured manner (memory or local DB) | # Parse the XML/RDF and put the data in a structured manner (memory or local DB) | ||
# Download the necessary HTML+EPUB data from | # Download the necessary HTML+EPUB data from Gutenberg.org based on the XML/RDF Catalog in a target directory | ||
# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) | # Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) | ||
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | # Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | ||
# Run zimwriterfs to create the corresponding ZIM file of your target directory | # Run zimwriterfs to create the corresponding ZIM file of your target directory | ||
== First investigation results == | |||
Work done by didier chez google.com and cniekel chez google.com | |||
The RDF index is at http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 | |||
wget works, contains 30k directories with each an rdf-file: every directory has 1 file with the rdf-description of one book. | |||
Emmanuel suggests the scraper should download everything into one dir, then converting the data into an output dir, then zim-ifying that directory. | |||
;Getting data | |||
Gutenberg supports rsync ( rsync -av --del ftp@ftp.ibiblio.org::gutenberg /var/www/gutenberg ) so using that to get data seems most simple. In 2011, the archive was 650GB. | |||
That was source, the generated data: | |||
rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated | |||
If I cd gutenberg-generated, there is stuff like: | |||
./10126/pg10126.rdf | |||
./10126/pg10126.mobi | |||
./10126/pg10126.xapian | |||
./10126/pg10126.plucker.pdb | |||
./10126/pg10126.qrcode.mobile.png | |||
./10126/pg10126.log | |||
./10126/pg10126.txt.utf8 | |||
./10126/pg10126.qrcode.desktop.png | |||
./10126/pg10126.epub | |||
./10126/pg10126.txt.utf8.gzip | |||
./10126/pg10126.converter.log | |||
./10126/pg10126.qrcode.png | |||
./10126/pg10126.qioo.jar | |||
To get epub+text+html, you'll need both rsync-trees, which seems quite inconvenient. | |||
So a caching fetch-by-url seems more convenient, the rdf-file contains the timestamp, which could be compared so updates to a book will be caught. | |||
So a on-disk-caching, robots-obeying url-retriever needs to be made/reused. If you can somehow filter which books to fetch (language-only, book-range), that will be convenient | |||
;Converting data | |||
... | |||
;Zimming | |||
one of the rdflib readers for python: https://github.com/RDFLib/rdflib.git | |||
The best Goobuntu packaged option seems to be: | |||
python-rdflib | |||
Building zimwriterfs | |||
sudo apt-get install libzim-dev liblzma-dev libmagic-dev autoconf automake | |||
git clone git://git.code.sf.net/p/kiwix/other kiwix-other | |||
cd kiwix-other/zimwriterfs/ | |||
./autogen.sh | |||
./configure | |||
make | |||
sudo make install | |||
gives you | |||
/usr/local/bin/zimwriterfs | |||
Driver steps | |||
rsync -av --del ftp@ftp.ibiblio.org::gutenberg-epub /var/www/gutenberg-generated | |||
copy-data-to-outputdir | |||
build-index | |||
== Scripting Stages == | |||
# [one-time only for mirror] rsync all data to zimfarm.kiwix.org/gutenberg | |||
# Download & Extract rdf-files.tar.bz2 | |||
# Loop through folder/files and parse RDF | |||
## Fill the Database with all data | |||
# Query the database to reflect filters and get list of books | |||
# Download the books based on filters (formats, languages) | |||
# [MLB] Generate a static folder repository of all ePUB files | |||
# Generate zimwriterfs-friendly folder of static HTML files based on templates and list of books. | |||
# Generate zim file from static folder | |||
* Prepare the templates | |||
** Article template | |||
** HomePage template | |||
** Index template? | |||
== Next steps == | |||
One of the problem is that even on Gutenberg, we don't have all the most important books of the French litterature. We should help to fix this. Here is the [[TOP 100 French ebooks to create|coordination page]]. | |||
== Others == | |||
* http://www.ebooksgratuits.com/opds/index.php | |||
* http://noslivres.net | |||
* http://fadedpage.com/ | |||
== List of books == | |||
* http://www.alalettre.com/fluctuat-10-livres-parfaits.php | |||
* https://fr.wikipedia.org/wiki/Biblioth%C3%A8que_id%C3%A9ale | |||
== See also == | |||
* [http://lite4.framapad.org/p/hackathon-kiwix-lyon Framapad with some notes] | |||
* [[Project Gutenberg/description_fr|Grant detail]] | |||
* [https://github.com/kiwix/gutenberg Github repository] | |||
* Photos: [[:commons:File:Kiwix hackathon Lyon juillet 2014 extérieur.jpg|outdoor]], [[:commons:File:Kiwix hackathon Lyon juillet 2014 intérieur.jpg|geeking]] |
edits