Difference between revisions of "Mediawiki DumpHTML extension improvement"

Jump to navigation Jump to search
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
The '''Mediawiki DumpHTML extension improvement''' is an effort which needs to be granted for being able to provide an efficient and a handful solution to make ZIM files from Mediawiki. With the solution we propose to develop, everyone from the Mediawiki Administrator to the normal user will be able to generate best quality small or big ZIM files.
:'''''This page is outdated! DumpHTML is no longer being worked on.'''''
The '''Mediawiki DumpHTML extension improvement''' was an effort which needs to be granted for being able to provide an efficient and a handful solution to make ZIM files from Mediawiki. With the solution we propose to develop, everyone from the Mediawiki Administrator to the normal user will be able to generate best quality small or big ZIM files.


== Context ==
== Context ==
The [http://www.openzim.org ZIM] format was choosen by the top actors around Mediawiki to provide a offline usable versions of the content. [http://www.kiwix.org Kiwix] is the reference reader supporting it, but ZIM is an open format, and they are also [http://www.openzim.org/ZIM_Readers other readers].
The [http://www.openzim.org ZIM] format was choosen by the top actors around Mediawiki to provide a offline usable versions of the content. [http://www.kiwix.org Kiwix] is the reference reader supporting it, but ZIM is an open format, and they are also [http://www.openzim.org/ZIM_Readers other readers].


The ZIM format was designed to deal efficiently with hugh amount of data, that means that you may deal with millions of pictures and text extremly quickly also on a small device like a smartphone. This format is complementary to the https://secure.wikimedia.org/wikipedia/en/wiki/EPUB EPUB] which is more though for small content and unable to scale.
The ZIM format was designed to deal efficiently with hugh amount of data, that means that you may deal with millions of pictures and text extremly quickly also on a small device like a smartphone. This format is complementary to the [[Wikipedia:EPUB|EPUB]] which is more though for small content and unable to scale.


Unfortunately, the development of the format is brake by a lack of softwares. We suffer from a lack of tools to build ZIM files and actually only a few people have the mandatory know-how and software solutions to do it:
Unfortunately, the development of the format is brake by a lack of softwares. We suffer from a lack of tools to build ZIM files and actually only a few people have the mandatory know-how and software solutions to do it:
Line 13: Line 14:
* Not maintained, bugs are not fixed, new features are not implemented
* Not maintained, bugs are not fixed, new features are not implemented
* Only available for Mediawiki system admin
* Only available for Mediawiki system admin
* Does not generate ZIM files
* Does not generate ZIM files (only static HTML and Media files)


== Challenges ==
== Challenges ==
Consequently, almost nobody uses it right now to generate ZIM files, this is too complicated and buggy. This, although a lot of people want to do that and contact the Kiwix dev. Team to help them to make a ZIM of their own content. But not only external projects would benefit from such a development, we would also gain a lot in efficiency and this would be the first mandatory step to prepare automatically ZIM files.
Consequently, almost nobody uses it right now to generate ZIM files, this is too complicated and buggy. This, although a lot of people want to do that and contact the Kiwix dev. Team to help them to make a ZIM of their own content.  
 
With this project we to offer:
* A way for everyone to generate ZIM files with Mediawiki
* the simpliest solution to use for end user
* the simpliest solution to use and deploy for Mediawiki administrators
* the solution offering the best HTML rendering quality
* the best designed solution simply to maintain and easy to improve (for example to add support to new HTML based format, like EPUB)
* the most performat solution in term of generating speed


== Workpackages ==
== Workpackages ==
=== Workpackage1: Revamping and fixing bugs ===
 
=== 1 - phpzim creation ===
 
This is about the creation of a ZIM [devzone.zend.com/303/extension-writing-part-i-introduction-to-php-and-zend/ PHP extension] called ''[[phpzim]]''. phpzim is an extension allowing PHP developers to read/write ZIM files. It's based on the [http://www.openzim.org/Zimlib zimlib] like [https://github.com/pediapress/pyzim pyzim], the Python extension to deal with ZIM files. phpzim is essential to:
* speed up the ZIM creation (avoiding using a postgresql database and the [http://www.openzim.org/Zimwriter zimwriter] binary).
* Mandatory to integration the ZIM generation directly in DumpHTML
* Essential for many CMS coded in php to generate also ZIM files
 
Delivarables:
* Create a tgz of the zimlib with only the necessary for phpzim
* Create the code (c++) of the phpzim PHP extension using the GNU tools for the compilation
* phpzim should offer a easy API to read/write ZIM files with all the necessary options
* Code of phpzim should be online developed on openZIM repository and as a tgz directly compilable
* Code usage should be documented and documentation should be automaticaly generated using doxygen or similar
* Rewrite and improve [http://sourceforge.net/p/kiwix/kiwix/ci/HEAD/tree/dumping_tools/scripts/buildZimFileFromDirectory.pl] in PHP (dealing directly with the zimlib)
 
Costs:
* ~ 4000 euros
 
{{rellink|More details: [[phpzim]]}}
 
=== 2 - Revamping and fixing bugs ===


The worth point is that the DumpHTML extension is not correctly maintained and with the time, [https://bugzilla.wikimedia.org/buglist.cgi?query_format=advanced&list_id=2671&component=DumpHTML&resolution=---&product=MediaWiki%20extensions many issues were discovered]. Currently, the extension is not really usable without fixing/tweaking the Mediawiki code.
The worth point is that the DumpHTML extension is not correctly maintained and with the time, [https://bugzilla.wikimedia.org/buglist.cgi?query_format=advanced&list_id=2671&component=DumpHTML&resolution=---&product=MediaWiki%20extensions many issues were discovered]. Currently, the extension is not really usable without fixing/tweaking the Mediawiki code.
Line 33: Line 63:
* ~ 4000 euros
* ~ 4000 euros


=== Workpackage2: phpzim creation an integration in DumpHTML extension ===
=== 3 - phpzim Integration ===


phpzim would be a new php module allowing to create/write and read ZIM file directly in PHP. This would be a binding of the zimlib, like pyzim in Python. With this library done, we will be able to create ZIM file directly from the DumpHTML.
phpzim would be a new [http://pecl.php.net/ PHP extension] allowing to create/write and read ZIM file directly in PHP. This would be a binding of the [http://www.openzim.org/Zimlib zimlib], like [https://code.launchpad.net/zim/pyzim pyzim] in Python. With this library done, we will be able to create ZIM file directly from the DumpHTML.


To be able to get a ZIM file, the user will have to call dumpHTML.php and specify that he wants a ZIM file output (not a HTML dump) also some other meta informations like title, creator, etc.
To be able to get a ZIM file, the user will have to call dumpHTML.php and specify that he wants a ZIM file output (not a HTML dump) also some other meta informations like title, creator, etc.
Line 42: Line 72:


Deliverables:
Deliverables:
* phpzim (40 hours)
* updated dumpHTML (40 hours)
* updated dumpHTML (20 hours)


Costs:
Costs:
* ~ 2500 euros
* ~ 1500 euros


=== Workpackage3: Integrating Collection and DumpHTML extensions and new features ===
=== 4 - Integrating Collection and DumpHTML extensions and new features ===


By integrating the DumpHTML and the Collection extension we want to give to everyone the capacity to easily create small ZIMs from the Wikipedia user interface with following advantages:
By integrating the DumpHTML and the Collection extension we want to give to everyone the capacity to easily create small ZIMs from the Wikipedia user interface with following advantages:
Line 58: Line 87:


Deliverables:
Deliverables:
* Book and DumpHTML integration (30 hours)
* Collection extension and DumpHTML integration (30 hours)
* DumpHTML parallel processing (15 hours)
* DumpHTML parallel processing (15 hours)
* Build selection based on list of titles (20 hours)
* Build selection based on list of titles (20 hours)
* Create an offline skin for mobiles to make dumps for mobiles (15 hours)
* Use new rendering for mobile to make ZIM files for mobiles (15 hours)
* Make offline skin to avoid pictures (5 hours)
* Make offline skin to avoid pictures (5 hours)


Line 72: Line 101:


If you want to know more:
If you want to know more:
* Kiwix presentation document (in French): http://www.kiwix.org/images/6/6f/Kiwix_presentation_fr.pdf
* Kiwix presentation document (in French): [[:File:Kiwix_presentation_fr.pdf]]
* Kiwix official Web site:http://www.kiwix.org
* Kiwix official Web site:http://www.kiwix.org

Navigation menu