Tools/en
The Kiwix tools are a set of scripts (mostly in Perl) aiming to help creating content usable by Kiwix. Current development code can be found at:
svn co http://kiwix.svn.sourceforge.net/svnroot/kiwix/tools/ kiwix-tools
Kiwix is primarily designed as a tool to publish copies of Wikipedia, but every effort is made to ensure it would also be useful for:
- release of other Wikimedia Foundation projects
- release of other content created on the Mediawiki platform.
As the heart of Kiwix is the HTML rendering engine Gecko, the objective of Kiwix tools is to produce:
- first, a coherent set of static HTML files and their needed resources: Stylesheets, JavaScript code, images, etc.
- Only then, and from these static files, the tools create a file in the ZIM format (see below)
Storage
We call such a coherent set of multimedia content a dump or a corpus. These dumps can take many forms: previous versions of Kiwix used a simple directory layout; Moulinwiki used a file compressed with bzip2 and indexed in an SQLite database.
Today, Kiwix uses the ZIM format: a single file contains the entire dump,allowing fast access, high compression and configurability.
ZIM is an open, standard format created and maintained by the openZIM project, of which Kiwix is a founding member. ZIM is itself based on an older format (Zeno). Zeno was created by the Berlin publishing house Directmedia and served for the German Wikipedia released on CD-ROM. Later, the Zeno format had been abandoned, but we wanted to continue development. The future will tell whether this initiative will be successful, but the goal is to make a standard and thus simplify the problem for each of the storage dumps. It is, anyway, already the best free solution.
Generating ZIM Files From Wikis
The question of how to generate a dump is not a simple one. For several reasons, Kiwix has so far concentrated on generating dumps offering a selection of a given Wiki site, even if the publication of complete Wikipedia dumps remains a clear objective. The Kiwix tools are designed to assist in the selection of entries, replication of content from the online site in a local mirror, and then from the mirror to a ZIM file.
But this is not the only method to generate a dump: theoretically, this can be done in different ways. Here is a small inexhaustive list of approaches:
- If you want to produce a complete dump, you can:
- obtain a ready HTML dump provided by the wiki admin, as provided here by the Wikimedia Foundation for example.
- mount a local mirror of the wiki, uploading the data (the content from another wiki) into the database and then generating an HTML dump by yourself. One can find such data for the Wikimedia Foundation here. In the case of a selection rather than a complete dump, you can also retrieve the data dynamically from the site (since the wiki is open source).
- generate an HTML dump directly (by retrieving the HTML pages) using software such as Vacuum on the website (be careful not to abuse the remote Web site by inordinate amounts of traffic, though!).
- If you want a partial dump, you must make a selection of items; once you have only the items you want, then the same process applies as with a complete dump.
There are certain constraints that should be taken into account. Here are the most important ones:
- material resources (equipment, power) of the server
- your own material resources
- the storage space you have for the final result
- how to make the selection if necessary.
Prerequisites
You'll need a bunch of Perl modules to run these scripts. Here is a list of modules one tester (User:Ijon) had to install given a plain Perl 5.10 installation on Ubuntu Linux. Your mileage may vary. Install them using CPAN (perl -MCPAN -e shell), CPANPLUS (cpanp(1)), or your distro's Perl bundling mechanism.
- Array::PrintCols
- Getargs::Long
- HTML::Parser
- HTML::Tagset
- LWP
- Log::Agent
- Log::Log4perl
- Term::Query
- URI
- XML::DOM
- XML::NamespaceSupport
- XML::Parser
- XML::Parser::PerlSAX
- XML::RegExp
- XML::SAX
- XML::SAX::Expat
- XML::Simple
I managed to install these by installing this subset and allowing automatic installation of dependencies:
- XML::Simple
- XML::DOM
- Term::Query
- Array::PrintCols
- Log::Log4perl
- Getargs::Long
Debian/Ubuntu dependencies
sudo apt-get install liblog-log4perl-perl libdata-dumper-simple-perl libxml-simple-perl libxml-libxml-perl libarray-printcols-perl libgetargs-long-perl liburi-perl libdata-dumper-simple-perl libhtml-linkextractor-perl libhtml-parser-perl libdbd-pg-perl
Usage
Here is a list of available scripts (many of them are specific to Mediawiki):
Mediawiki Maintenance
- backupMediawikiInstall.pl creates a tgz archive of a complete existing Mediawiki installation (code + resources + database).
- installMediawiki.pl brings up an instance of Mediawiki from source code without human intervention. This actually simulates the manual Mediawiki installation process.
- resetMediawikiDatabase.pl empties a local instance of Mediawiki of all pages.
Mirroring Tools
- buildHistoryFile.pl given a list of articles and an online Mediawiki site, obtains complete histories of each page on the list.
- extractContributorsFromHistoryFile.pl extracts a list of authors from the histories obtained by the buildHistoryFile.pl script.
- buildContributorsHtmlPages.pl given a template and a list of authors, builds a custom set of HTML pages containing all of the authors on the list.
- checkMediawikiPageCompleteness.pl check if the local copies of pages from an online Mediawiki site are complete, i.e. have no dependencies (template files, multimedia resources, etc.) missing.
- checkPageExistence.pl given a list of page titles and an online Mediawiki site, checks whether such pages exist in it. This can be handy, for example, to see what pages have been replicated.
- checkRedirects.pl checks if there are no pages redirecting to non-existent pages (i.e. broken redirects). Eventually, it should also check against pages redirecting to each other.
- listAllImages.pl lists all images of an online Mediawiki site.
- listAllPages.pl lists all pages in an online Mediawiki site.
- listCategoryEntries.pl lists the pages belonging to a category, recursively.
- listRedirects.pl list page redirects in an online Mediawiki site.
- mirrorMediawikiCode.pl downloads the exact same version used by an online MediaWiki site; this includes both Mediawiki code and Mediawiki extensions.
- mirrorMediawikiInterwikis.pl installs to a local Mediawiki site the InterWikis (cross-language links) exactly identical to an online Mediawiki site.
- mirrorMediawikiPages.pl copies a set of pages and their dependencies (template and multimedia resources) from an online Mediawiki site to a local Mediawiki site.
- modifyMediawikiEntry.pl removes, deletes, or replaces a list of pages from an online Mediawiki site.
Dumping Tools
- checkEmptyFilesInHtmlDirectory.pl checks whether a directory and its subdirectories contain empty files.
- dumpHtml.pl given a local Mediawiki site, makes all-static copies of pages, i.e. creates a directory with all needed HTML.
- launchTntreader.pl easily launches the tntreader program.
- optimizeContents.pl optimizes a directory with HTML pages and resources. This script calls the following extensions: HTML Tidy for HTML files; The Little utils for images.
ZIM Generation
- buildZimFileFromDirectory.pl creates a ZIM file from a directory tree containing static HTML and other content files.
Virtual machine
We have prepared a VM to help people to make ZIM files from their HTML files. Download it there. Unix login/pass are root/kiwix and for postgres: postgres/kiwix. To build your ZIM file go to root/dumping_tools/scripts and use buildZimFileFromdirectory.pl.