Talk:Mediawiki DumpHTML extension improvement

From Kiwix
Revision as of 13:38, 6 September 2011 by Kelson (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

I highly appreciate your efforts to make automatic zim dumps happen. Two comments regarding mobile support:

  • Equations

As you mention in your proposal it is important for mobile usage have zim versions without images. (As else the zim files are too large for most potential users).

Removal of all images is fine for most cases, however it has the drawback that it removes mathematical equations as well, as these are rendered as images.

This issue should be considered in the generation of the zim files. A variant is to include the ALT text of equations (or all images if this is easier). Much better than nothing, but rather cryptic. An other variant is to include images for equations only. If the overhead in file size is high, it may even make sense to generate both variants. (Thus user could download either version with all images, equation images only, or without images (but with alt text for at least the equations)).

I never consider tex renderer equations as "images", they should and will be always there. This is easy to achieve with DumpHTML extension, because handlings (equations vs images) are different in the php code. I also do not think this should increase a lot of the ZIM file size at the end. We do not have any issue here I think. Kelson 19:17, 31 August 2011 (CEST)
Good to know. Actually it was a bug in my app which led to omission of equation images. The old german wikipedia on openzim.org contains equations. It has a little less than 1000000 and its size is only 1.5GB. So I'd also expect that equations don't increase ZIM file size a lot. --Cip 01:39, 3 September 2011 (CEST)

Independently, it definitely also make sense (not only for mobile use case) to also have additionally zim files with a small selection of images only. I am aware that this is pretty complex to do, but I may be worth experimenting whether a pretty simple logic e.g. like "all equation images, first N images for 10000 most popular articles (or first image of all articles)" could give good enough results. However, this "zim-file with a selection of images"-feature should not delay the implementation of the automatic dumps, but it may make sense to be added later.

Fully agree with you (but only images, no equations). An algorithm to identify "important" pictures and sort them would be great. If you have time to work on that this would be really great. I'm also ready to help you to test the results with test ZIM files. This work should also IMO be granted. Kelson 19:17, 31 August 2011 (CEST)
I'm sorry but I fear that I won't have time to work on this. --Cip 01:39, 3 September 2011 (CEST)
  • Split

For user experience on mobile phones it would be nice to offer zim files split into 2GB chunks on the wikimedia servers. (In a zip archive, or as separate downloads, best to offer both)

4GB is the limit due to the FAT32 file system, but some phones have a 2GB limit.

It's true the mobile apps could include download functionality which splits the file during download, but this has some drawbacks:

1. The user may not be able or may not want to download that large files on the mobile phone.

My opinon (1) splited versions of each content should be alway provided (2) We should find a solution if the file system where files are saved is not able to deal with big file to detect it automaticaly (3) Version for mobile should be provided beside the other ones (4) reader/kiwix should be able to know if he needs a version for mobile or not (5) download per default an adequat version (splited or not, mobile version or not). Kelson 19:27, 31 August 2011 (CEST)
I don't understand what you mean with "(2) We should find a solution if the file system where files are saved is not able to deal with big file to detect it automaticaly": If the app is doing the download there is no real benefit in downloading one large file instead of downloading multiple small files. Therefore I'd propose to keep it simple and just always download splitted files (2GB), independent of the target file system. (In particular as it could be pretty difficult to impossible to detect the file system on a mobile device). It may even make sense to also download 2GB-splits on desktop-kiwix as well (As the user can then copy the downloaded files easily to a mobile phone, although I agree that in the desktop-case having a single file also has it benefits.
Have to think a bout that (always 2GB splitted files)... they are pro/cons. Currently, Kiwix content manager is not able to deal with many files to download... This is something I have in any case to think about. Kelson 15:38, 6 September 2011 (CEST)
Having additionally a single file is benefical if the user wants to download a zim file manually just using the webbrowser. While this is basically true for both desktop and mobile users, for mobile users the benefit may be pretty limited, as in this case typically smaller (no images) zim files, and so also less separate files need to be downloaded, and there is a pretty good chance that the user finds out only after downloading the large file that it won't work on his mobile.
Yes, not easy to have THE solution. What is sure is that I want to make the content manager of Kiwix so good, that a max of people will use it to get new content. But they will be always people downloading content separately. On the other side, I hope good smartphones won't have this limitation longer... Kelson 15:38, 6 September 2011 (CEST)
ad "(3) Version for mobile should be provided beside the other ones". What exactly do you mean with version for mobile? I was thinking that different versions (Like: No images, with image selection (as soon as algorithm is available ;), all images) are available, which can all be used both on desktop or mobile. For sure probably the majority of mobile users will use a non-image version while desktop users may prefer the all-image version, but in the end its up to the user what version she wants to use. --Cip 01:39, 3 September 2011 (CEST)
ZIM files should be readable for everyone, but this is obvious that the manner to render a wikipage for a small screen is different than for a big screen. That is why the WMF has two different wiki renderers (http://de.wikipedia.org vs http://de.m.wikipedia.org/). So, in addition to with/few/without images ZIM files... we should also make ZIM for small screens, based on mobile render solution of Mediawiki (new - you were also at the Presentation at Wikimania). Kelson 15:38, 6 September 2011 (CEST)

2. It is pretty complex to support such a feature in an app. (i.p. on multiple platforms) This may mean that apps just don't support this feature and let it up to the user to download the files. If there are no split files available, which are pretty simple to install (connect mobile via usb, extract archive (or copy separate files) to mobile), many potential users won't be able to use the zim file. For this scenario having separate download is better, because this would allow downloading on the mobile phone as well (without app support), while the zip-file could normally not be downloaded (as larger than 2 (or 4) GB. --Cip 18:59, 31 August 2011 (CEST)

Separate download should always be possible. Detecting a mobile device is easy... in fact what matters is the screen resolution (we could introduce a meta-tag with something like minresolution inside). Detecting the nature of the filesystem should be also possible... although I do not know how? An idea? Kelson 19:27, 31 August 2011 (CEST)
Sorry, I don't get this. Are you thinking about the web page providing the zim downloads? --Cip 01:39, 3 September 2011 (CEST)
Is still unclear what I mean? 15:38, 6 September 2011 (CEST)