Difference between revisions of "Tools/en"

From Kiwix
Jump to navigation Jump to search
(Began translating the tools page, for the benefit of non-Francophones)
 
(Translation of the Generation section)
Line 4: Line 4:


Kiwix is primarily designed as a tool to publish copies of Wikipedia, but every effort is made to ensure it would also be useful for:
Kiwix is primarily designed as a tool to publish copies of Wikipedia, but every effort is made to ensure it would also be useful for:
* release of [[http://www.wikimedia.org/ other Wikimedia Foundation projects]
* release of [http://www.wikimedia.org/ other Wikimedia Foundation projects]
* release of other content created on the Mediawiki platform.
* release of other content created on the Mediawiki platform.


As the heart of Kiwix being the HTML rendering engine Gecko, the objective of Kiwix tools is to produce:
As the heart of Kiwix is the HTML rendering engine Gecko, the objective of Kiwix tools is to produce:


* first, a coherent set of static HTML files and their needed resources: Stylesheets, JavaScript code, images, etc.
* first, a coherent set of static HTML files and their needed resources: Stylesheets, JavaScript code, images, etc.
Line 18: Line 18:


ZIM is an open, standard format created and maintained by the [http://www.openzim.org openZIM project], of which Kiwix is a founding member.  ZIM is itself based on an older format (Zeno). Zeno was created by the Berlin publishing house [http://www.digitale-bibliothek.de Directmedia] and served for [http://www.amazon.de/Wikipedia-2007-2008-Kompakt-DVD-ROM/dp/3866400187/ref=sr_1_1?ie=UTF8&s=software&qid=1232812631&sr=8-1 the German Wikipedia released on CD-ROM].  Later, the Zeno format had been abandoned, but we wanted to continue development. The future will tell whether this initiative will be successful, but the goal is to make a standard and thus simplify the problem for each of the storage dumps. It is, anyway, ''already'' the best ''free'' solution.
ZIM is an open, standard format created and maintained by the [http://www.openzim.org openZIM project], of which Kiwix is a founding member.  ZIM is itself based on an older format (Zeno). Zeno was created by the Berlin publishing house [http://www.digitale-bibliothek.de Directmedia] and served for [http://www.amazon.de/Wikipedia-2007-2008-Kompakt-DVD-ROM/dp/3866400187/ref=sr_1_1?ie=UTF8&s=software&qid=1232812631&sr=8-1 the German Wikipedia released on CD-ROM].  Later, the Zeno format had been abandoned, but we wanted to continue development. The future will tell whether this initiative will be successful, but the goal is to make a standard and thus simplify the problem for each of the storage dumps. It is, anyway, ''already'' the best ''free'' solution.
==Generating ZIM Files From Wikis==
The question of how to generate a dump is not a simple one.  For several reasons, Kiwix has so far concentrated on generating dumps offering a selection of a given Wiki site, even if the publication of complete Wikipedia dumps remains a clear objective.  The Kiwix tools are designed to assist in the selection of entries, replication of content from the online site in a local mirror, and then from the mirror to a ZIM file.
But this is not the only method to generate a dump: theoretically, this can be done in different ways. Here is a small inexhaustive list of approaches:
* If you want to produce a complete dump, you can:
** obtain a ready HTML dump provided by the wiki admin, as [http://static.wikipedia.org/ provided here by the Wikimedia Foundation] for example.
** mount a local mirror of the wiki, uploading the data (the content from another wiki) into the database and then generating an HTML dump by yourself. One can find such data for the Wikimedia Foundation [http://download.wikimedia.org/backup-index.html here].  In the case of a selection rather than a complete dump, you can also retrieve the data dynamically from the site (since the wiki is open source).
** generate an HTML dump directly (by retrieving the HTML pages) using software such as Vacuum on the website (be careful not to abuse the remote Web site by inordinate amounts of traffic, though!).
* If you want a partial dump, you must make a selection of items; once you have only the items you want, then the same process applies as with a complete dump.
There are certain constraints that should be taken into account.  Here are the most important ones:
* material resources (equipment, power) of the server
* your own material resources
* the storage space you have for the final result
* how to make the selection if necessary.

Revision as of 00:19, 25 March 2010


The Kiwix tools are a set of scripts (mostly in Perl) aiming to help creating content usable by Kiwix.

Kiwix is primarily designed as a tool to publish copies of Wikipedia, but every effort is made to ensure it would also be useful for:

As the heart of Kiwix is the HTML rendering engine Gecko, the objective of Kiwix tools is to produce:

  • first, a coherent set of static HTML files and their needed resources: Stylesheets, JavaScript code, images, etc.
  • Only then, and from these static files, the tools create a file in the ZIM format (see below)

Storage

We call such a coherent set of multimedia content a dump or a corpus. These dumps can take many forms: previous versions of Kiwix used a simple directory layout; Moulinwiki used a file compressed with bzip2 and indexed in an SQLite database.

Today, Kiwix uses the ZIM format: a single file contains the entire dump,allowing fast access, high compression and configurability.

ZIM is an open, standard format created and maintained by the openZIM project, of which Kiwix is a founding member. ZIM is itself based on an older format (Zeno). Zeno was created by the Berlin publishing house Directmedia and served for the German Wikipedia released on CD-ROM. Later, the Zeno format had been abandoned, but we wanted to continue development. The future will tell whether this initiative will be successful, but the goal is to make a standard and thus simplify the problem for each of the storage dumps. It is, anyway, already the best free solution.

Generating ZIM Files From Wikis

The question of how to generate a dump is not a simple one. For several reasons, Kiwix has so far concentrated on generating dumps offering a selection of a given Wiki site, even if the publication of complete Wikipedia dumps remains a clear objective. The Kiwix tools are designed to assist in the selection of entries, replication of content from the online site in a local mirror, and then from the mirror to a ZIM file.

But this is not the only method to generate a dump: theoretically, this can be done in different ways. Here is a small inexhaustive list of approaches:

  • If you want to produce a complete dump, you can:
    • obtain a ready HTML dump provided by the wiki admin, as provided here by the Wikimedia Foundation for example.
    • mount a local mirror of the wiki, uploading the data (the content from another wiki) into the database and then generating an HTML dump by yourself. One can find such data for the Wikimedia Foundation here. In the case of a selection rather than a complete dump, you can also retrieve the data dynamically from the site (since the wiki is open source).
    • generate an HTML dump directly (by retrieving the HTML pages) using software such as Vacuum on the website (be careful not to abuse the remote Web site by inordinate amounts of traffic, though!).
  • If you want a partial dump, you must make a selection of items; once you have only the items you want, then the same process applies as with a complete dump.

There are certain constraints that should be taken into account. Here are the most important ones:

  • material resources (equipment, power) of the server
  • your own material resources
  • the storage space you have for the final result
  • how to make the selection if necessary.