Difference between revisions of "Tools/en"

Revision as of 20:25, 26 November 2011

The Kiwix tools are a set of scripts (mostly in Perl) aiming to help creating content usable by Kiwix.

Kiwix is primarily designed as a tool to publish copies of Wikipedia, but every effort is made to ensure it would also be useful for:

release of other Wikimedia Foundation projects
release of other content created on the Mediawiki platform.

As the heart of Kiwix is the HTML rendering engine Gecko, the objective of Kiwix tools is to produce:

first, a coherent set of static HTML files and their needed resources: Stylesheets, JavaScript code, images, etc.
Only then, and from these static files, the tools create a file in the ZIM format (see below)

Storage

We call such a coherent set of multimedia content a dump or a corpus. These dumps can take many forms: previous versions of Kiwix used a simple directory layout; Moulinwiki used a file compressed with bzip2 and indexed in an SQLite database.

Today, Kiwix uses the ZIM format: a single file contains the entire dump,allowing fast access, high compression and configurability.

ZIM is an open, standard format created and maintained by the openZIM project, of which Kiwix is a founding member. ZIM is itself based on an older format (Zeno). Zeno was created by the Berlin publishing house Directmedia and served for the German Wikipedia released on CD-ROM. Later, the Zeno format had been abandoned, but we wanted to continue development. The future will tell whether this initiative will be successful, but the goal is to make a standard and thus simplify the problem for each of the storage dumps. It is, anyway, already the best free solution.

Generating ZIM Files From Wikis

The question of how to generate a dump is not a simple one. For several reasons, Kiwix has so far concentrated on generating dumps offering a selection of a given Wiki site, even if the publication of complete Wikipedia dumps remains a clear objective. The Kiwix tools are designed to assist in the selection of entries, replication of content from the online site in a local mirror, and then from the mirror to a ZIM file.

But this is not the only method to generate a dump: theoretically, this can be done in different ways. Here is a small inexhaustive list of approaches:

If you want to produce a complete dump, you can:
- obtain a ready HTML dump provided by the wiki admin, as provided here by the Wikimedia Foundation for example.
- mount a local mirror of the wiki, uploading the data (the content from another wiki) into the database and then generating an HTML dump by yourself. One can find such data for the Wikimedia Foundation here. In the case of a selection rather than a complete dump, you can also retrieve the data dynamically from the site (since the wiki is open source).
- generate an HTML dump directly (by retrieving the HTML pages) using software such as Vacuum on the website (be careful not to abuse the remote Web site by inordinate amounts of traffic, though!).
If you want a partial dump, you must make a selection of items; once you have only the items you want, then the same process applies as with a complete dump.

There are certain constraints that should be taken into account. Here are the most important ones:

material resources (equipment, power) of the server
your own material resources
the storage space you have for the final result
how to make the selection if necessary.

Prerequisites

You'll need a bunch of Perl modules to run these scripts. Here is a list of modules one tester (User:Ijon) had to install given a plain Perl 5.10 installation on Ubuntu Linux. Your mileage may vary. Install them using CPAN (perl -MCPAN -e shell), CPANPLUS (cpanp(1)), or your distro's Perl bundling mechanism.

Array::PrintCols
Getargs::Long
HTML::Parser
HTML::Tagset
LWP
Log::Agent
Log::Log4perl
Term::Query
URI
XML::DOM
XML::NamespaceSupport
XML::Parser
XML::Parser::PerlSAX
XML::RegExp
XML::SAX
XML::SAX::Expat
XML::Simple

I managed to install these by installing this subset and allowing automatic installation of dependencies:

XML::Simple
XML::DOM
Term::Query
Array::PrintCols
Log::Log4perl
Getargs::Long

Debian/Ubuntu dependencies

sudo apt-get install liblog-log4perl-perl libdata-dumper-simple-perl libxml-simple-perl
libxml-libxml-perl libarray-printcols-perl libgetargs-long-perl
liburi-perl libdata-dumper-simple-perl libhtml-linkextractor-perl
libhtml-parser-perl libdbd-pg-perl

Usage

Here is a list of available scripts (many of them are specific to Mediawiki):

Mediawiki Maintenance

backupMediawikiInstall.pl creates a tgz archive of a complete existing Mediawiki installation (code + resources + database).
installMediawiki.pl brings up an instance of Mediawiki from source code without human intervention. This actually simulates the manual Mediawiki installation process.
resetMediawikiDatabase.pl empties a local instance of Mediawiki of all pages.

Mirroring Tools

buildHistoryFile.pl given a list of articles and an online Mediawiki site, obtains complete histories of each page on the list.
- extractContributorsFromHistoryFile.pl extracts a list of authors from the histories obtained by the buildHistoryFile.pl script.
buildContributorsHtmlPages.pl given a template and a list of authors, builds a custom set of HTML pages containing all of the authors on the list.
checkMediawikiPageCompleteness.pl check if the local copies of pages from an online Mediawiki site are complete, i.e. have no dependencies (template files, multimedia resources, etc.) missing.
checkPageExistence.pl given a list of page titles and an online Mediawiki site, checks whether such pages exist in it. This can be handy, for example, to see what pages have been replicated.
checkRedirects.pl checks if there are no pages redirecting to non-existent pages (i.e. broken redirects). Eventually, it should also check against pages redirecting to each other.
listAllImages.pl lists all images of an online Mediawiki site.
listAllPages.pl lists all pages in an online Mediawiki site.
listCategoryEntries.pl lists the pages belonging to a category, recursively.
listRedirects.pl list page redirects in an online Mediawiki site.
mirrorMediawikiCode.pl downloads the exact same version used by an online MediaWiki site; this includes both Mediawiki code and Mediawiki extensions.
mirrorMediawikiInterwikis.pl installs to a local Mediawiki site the InterWikis (cross-language links) exactly identical to an online Mediawiki site.
mirrorMediawikiPages.pl copies a set of pages and their dependencies (template and multimedia resources) from an online Mediawiki site to a local Mediawiki site.
modifyMediawikiEntry.pl removes, deletes, or replaces a list of pages from an online Mediawiki site.

Dumping Tools

checkEmptyFilesInHtmlDirectory.pl checks whether a directory and its subdirectories contain empty files.
dumpHtml.pl given a local Mediawiki site, makes all-static copies of pages, i.e. creates a directory with all needed HTML.
launchTntreader.pl easily launches the tntreader program.
optimizeContents.pl optimizes a directory with HTML pages and resources. This script calls the following extensions: HTML Tidy for HTML files; The Little utils for images.

ZIM Generation

buildZimFileFromDirectory.pl creates a ZIM file from a directory tree containing static HTML and other content files.

Virtual machine

We have prepared a VM to help people to make ZIM files from their HTML files. Download it there. Unix login/pass are root/kiwix and for postgres: postgres/kiwix. To build your ZIM file go to root/dumping_tools/scripts and use buildZimFileFromdirectory.pl.

Revision as of 18:31, 20 September 2011 (view source) Kelson (talk \| contribs) (→‎ZIM Generation) ← Older edit		Revision as of 20:25, 26 November 2011 (view source) Poldi (talk \| contribs) m (typo) Newer edit →
Line 108:		Line 108:
	== Virtual machine ==		== Virtual machine ==

	We have prepared a WM to help people to make ZIM files from their HTML files. Download it [http://download.kiwix.org/dev/ZIMmakerVM.ova.lzma there]. Unix login/pass are root/kiwix and for postgres: postgres/kiwix. To build your ZIM file go to root/dumping_tools/scripts and use buildZimFileFromdirectory.pl.		We have prepared a VM to help people to make ZIM files from their HTML files. Download it [http://download.kiwix.org/dev/ZIMmakerVM.ova.lzma there]. Unix login/pass are root/kiwix and for postgres: postgres/kiwix. To build your ZIM file go to root/dumping_tools/scripts and use buildZimFileFromdirectory.pl.