Mediawiki DumpHTML extension improvement

From Kiwix
Jump to navigation Jump to search

The Mediawiki DumpHTML extension improvement is an effort which needs to be granted for being able to provide an efficient and a handful solution to make ZIM files from Mediawiki. With the solution we propose to develop, everyone from the Mediawiki Administrator to the normal user will be able to generate best quality small or big ZIM files.

Context

The ZIM format was choosen by the top actors around Mediawiki to provide a offline usable versions of the content. Kiwix is the reference reader supporting it, but ZIM is an open format, and they are also other readers.

The ZIM format was designed to deal efficiently with hugh amount of data, that means that you may deal with millions of pictures and text extremly quickly also on a small device like a smartphone. This format is complementary to the https://secure.wikimedia.org/wikipedia/en/wiki/EPUB EPUB] which is more though for small content and unable to scale.

Unfortunately, the development of the format is brake by a lack of softwares. We suffer from a lack of tools to build ZIM files and actually only a few people have the mandatory know-how and software solutions to do it:

  • Kiwix, which use a solution based on a hacked version of Mediawiki DumpHTML extension, with additional custom scripts. This is currently the only one project generating (and more or less able to generate) big ZIM files from WMF projects. You may download these ZIM files here. This solution is currently not usable at all project external people.
  • Mediawiki Collection extension developed by Pediapress (which is deployed on Wikipedia) is user friendly but suffer of many issues: (1) Really complicated to install on a separate instance (2) slow and not able at all to deal with huge amount of data (4) Rendering quality far away of online version (5) technical approach makes they are not able at all to tune the content rendering for offline usage.

Our years long experience showed us that the Mediawiki DumpHTML extension (at least the approach) is the best solution to export the Mediawiki dynamic generated HTML to an offline usable format. Unfortunately, they are few pain points:

  • Not maintained, bugs are not fixed, new features are not implemented
  • Only available for Mediawiki system admin
  • Does not generate ZIM files

Challenges

Consequently, almost nobody uses it right now to generate ZIM files, this is too complicated and buggy. This, although a lot of people want to do that and contact the Kiwix dev. Team to help them to make a ZIM of their own content.

With this project we to offer:

  • A way for everyone to generate ZIM files with Mediawiki
  • the simpliest solution to use for end user
  • the simpliest solution to use and deploy for Mediawiki administrators
  • the solution offering the best HTML rendering quality
  • the best designed solution simply to maintain and easy to improve
  • the most performat solution in term of generating speed

Workpackages

1 - Revamping and fixing bugs

The worth point is that the DumpHTML extension is not correctly maintained and with the time, many issues were discovered. Currently, the extension is not really usable without fixing/tweaking the Mediawiki code.

The purpose of this work package is to fix the most critical bugs to achieve that everyone having a Mediawiki would be able to simply get a HTML dumps of his content and consequently being able to easily generate a ZIM file afterwards

After the revamping, the result should be perfect in HTML similar to what dumpHTML.pl does.

Delivarables:

  • Revamping dumpHTML and fixing bugs (80-120 hours)

Costs:

  • ~ 4000 euros

2 - phpzim creation an integration in DumpHTML extension

phpzim would be a new PHP extension allowing to create/write and read ZIM file directly in PHP. This would be a binding of the zimlib, like pyzim in Python. With this library done, we will be able to create ZIM file directly from the DumpHTML.

To be able to get a ZIM file, the user will have to call dumpHTML.php and specify that he wants a ZIM file output (not a HTML dump) also some other meta informations like title, creator, etc.

So the user will need a system access where the Mediawiki instance runs and also to install phpzim (should be packaged).

Deliverables:

  • phpzim (40 hours)
  • updated dumpHTML (20 hours)

Costs:

  • ~ 2500 euros

3 - Integrating Collection and DumpHTML extensions and new features

By integrating the DumpHTML and the Collection extension we want to give to everyone the capacity to easily create small ZIMs from the Wikipedia user interface with following advantages:

  • exactly the same rendering as online
  • no external dependency to install for the Mediawiki admin
  • rendering done by MediaWiki (as fast as online browsing)

In addition we want to implement a few additional features (see the list of deliverables).

Deliverables:

  • Book and DumpHTML integration (30 hours)
  • DumpHTML parallel processing (15 hours)
  • Build selection based on list of titles (20 hours)
  • Create an offline skin for mobiles to make dumps for mobiles (15 hours)
  • Make offline skin to avoid pictures (5 hours)

Costs:

  • ~ 3500 euros

Realisation

The realisation of the whole project would take 4 months and would be supervised by Kelson (creator and lead developer of Kiwix) and a member of WMFR. Payment would be done after validation by both supervisers of each workpackage.

If you want to know more: