Difference between revisions of "Mediawiki DumpHTML extension improvement"
Line 1: | Line 1: | ||
The [http://www.mediawiki.org/wiki/Extension:DumpHTML Mediawiki DumpHTML extension] is the best solution to export the dynamic generated HTML pages in a set of static HTML/Media files. This extension is better and has more potential to get a good set of HTML pages from a Mediawiki (in comparison with a Web site mirroring tool for example or an extern rendering solution). This is true especially if you deal with an big amount of content. | The [http://www.mediawiki.org/wiki/Extension:DumpHTML Mediawiki DumpHTML extension] is the best solution to export the dynamic generated HTML pages in a set of static HTML/Media files. This extension is better and has more potential to get a good set of HTML pages from a Mediawiki (in comparison with a Web site mirroring tool for example or an extern rendering solution). This is true especially if you deal with an big amount of content and actually this is the solution retained to the big ZIM [[Template:ZIMdumps|you may already yet download on Kiwix Web site]]. | ||
Unfortunately, they are few pain points | Unfortunately, they are few pain points: | ||
* Not maintained, bugs are not fixed, new features are not implemented | * Not maintained, bugs are not fixed, new features are not implemented | ||
* Only available for Mediawiki system admin | * Only available for Mediawiki system admin | ||
* Does not generate ZIM files | * Does not generate ZIM files | ||
Consequently, almost nobody use it right now to generate ZIM files, this is too complicated. This, although a lot of people want to do that and contact the Kiwix dev. Team to help them to make a ZIM of their own content. But not only external projects would benefit from such a development, we would also gain a lot in efficiency and this would be the first mandatory step to prepare automatically ZIM files. | |||
==== Workpackage1: Revamping and fixing bugs ==== | ==== Workpackage1: Revamping and fixing bugs ==== |
Revision as of 16:34, 23 August 2011
The Mediawiki DumpHTML extension is the best solution to export the dynamic generated HTML pages in a set of static HTML/Media files. This extension is better and has more potential to get a good set of HTML pages from a Mediawiki (in comparison with a Web site mirroring tool for example or an extern rendering solution). This is true especially if you deal with an big amount of content and actually this is the solution retained to the big ZIM you may already yet download on Kiwix Web site.
Unfortunately, they are few pain points:
- Not maintained, bugs are not fixed, new features are not implemented
- Only available for Mediawiki system admin
- Does not generate ZIM files
Consequently, almost nobody use it right now to generate ZIM files, this is too complicated. This, although a lot of people want to do that and contact the Kiwix dev. Team to help them to make a ZIM of their own content. But not only external projects would benefit from such a development, we would also gain a lot in efficiency and this would be the first mandatory step to prepare automatically ZIM files.
Workpackage1: Revamping and fixing bugs
The worth point is that the DumpHTML extension is not correctly maintained and with the time, many issues were discovered. Currently, the extension is not really usable without fixing/tweaking the Mediawiki code.
The purpose of this work package is to fix the most critical bugs to achieve that everyone having a Mediawiki would be able to simply get a HTML dumps of his content and consequently being able to easily generate a ZIM file afterwards
After the revamping, the result should be perfect in HTML similar to what dumpHTML.pl does.
Delivarables:
- Revamping dumpHTML and fixing bugs (80-120 hours)
Costs:
- ~ 4000 euros
Workpackage2: phpzim creation an integration in DumpHTML extension
phpzim would be a new php module allowing to create/write and read ZIM file directly in PHP. This would be a binding of the zimlib, like pyzim in Python. With this library done, we will be able to create ZIM file directly from the DumpHTML.
To be able to get a ZIM file, the user will have to call dumpHTML.php and specify that he wants a ZIM file output (not a HTML dump) also some other meta informations like title, creator, etc.
So the user will need a system access where the Mediawiki instance runs and also to install phpzim (should be packaged).
Deliverables:
- phpzim (40 hours)
- updated dumpHTML (20 hours)
Costs:
- ~ 2500 euros
Workpackage3: Integrating Collection and DumpHTML extensions and new features
By integrating the DumpHTML and the Collection extension we want to give to everyone the capacity to easily create small ZIMs from the Wikipedia user interface with following advantages:
- exactly the same rendering as online
- no external dependency to install for the Mediawiki admin
- rendering done by MediaWiki (as fast as online browsing)
In addition we want to implement a few additional features (see the list of deliverables).
Deliverables:
- Book and DumpHTML integration (30 hours)
- DumpHTML parallel processing (15 hours)
- Build selection based on list of titles (20 hours)
- Create an offline skin for mobiles to make dumps for mobiles (15 hours)
- Make offline skin to avoid pictures (5 hours)
Costs:
- ~ 3500 euros