21
edits
(→Goals) |
RashiqAhmad (talk | contribs) (→Agenda) |
||
(17 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
'''TED''' (Technology, Entertainment, Design) is a global set of conferences under the slogan "ideas worth spreading". They address a wide range of topics within the research and practice of science and culture, often through storytelling. The speakers are given a maximum of 18 minutes to present their ideas in the most innovative and engaging ways they can. Its web site is www.ted.com. The purpose of this project is to create a sustainable solution to create | '''TED''' (Technology, Entertainment, Design) is a global set of conferences under the slogan "ideas worth spreading". They address a wide range of topics within the research and practice of science and culture, often through storytelling. The speakers are given a maximum of 18 minutes to present their ideas in the most innovative and engaging ways they can. Its web site is [http://www.ted.com www.ted.com]. | ||
The purpose of this project is to create a sustainable solution to create ZIM files providing the TED and TEDx videos in a similar manner like ted.com | |||
== Goals == | == Goals == | ||
Line 8: | Line 10: | ||
* The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....) | * The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....) | ||
== | == Preparations == | ||
As it currently stands there is a redesign of the TED site, that is currently available at http://new.ted.com. <br> | |||
We should focus on scraping that site, because the old one will eventuelly get discontinued. | |||
== Site structure == | |||
http://new.ted.com/talks gives you a list of all the TED talks sorted by playlists and categories. | |||
http://new.ted.com/talks/browse?sort=popular will give you list of all the TED talks in one place sorted by popularity. <br> | |||
It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on. | |||
== Ideas to achieve it == | |||
# Retrieve the list of TED(x) presentations with medatas in a local database | # Retrieve the list of TED(x) presentations with medatas in a local database | ||
## A whole list of the available TED talks is available [http://www.ted.com/talks/quick-list here] (official) or [http://goo.gl/lx9Ro here] (unofficial) | |||
## TEDx talks by language are available [http://tedxtalks.ted.com/pages/languages here]. | |||
# Download videos and re-encode them if necessary | # Download videos and re-encode them if necessary | ||
# Retrieve the video subtitle files | # Retrieve the video subtitle files from www.amara.org | ||
# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) | ## Subtitle don't make so much sense for TEDx | ||
## TED has a translation program [http://www.ted.com/OpenTranslationProject here] | |||
# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) with Jinja2 | |||
## Interesting to [http://diveintohtml5.info/storage.html read this] to get an idea how to store a database client side. | |||
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | # Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | ||
# Run zimwriterfs to create the corresponding ZIM file of your target directory | # Run zimwriterfs to create the corresponding ZIM file of your target directory | ||
== Agenda == | |||
* First three days (17.-19.02.2014): | |||
** Planning on how this project can be realized {{done}} | |||
** Creation of a concept including a conceptional Zim file, that demonstrates the very basics of this project {{done}} | |||
* Rest of the first week (20. - 23.02.2104): | |||
** Collection of all the data {{done}} | |||
*** Writing the Scraper, that scrapes TED.com {{done}} | |||
*** Writing the Scraper, that scrapes the TED translation page on ww.amara.org {{done}} | |||
*** Writing the html templates {{done}} | |||
*** Writing a python script, that dumps the scraped data into the HTML pages, creating static content {{done}} | |||
* First three days of the second week (24. - 26.02.2014): | |||
** Implementing the local database, that manages all the content {{done}} | |||
** Implementing the search engine in Javascript, that allows the user to search through all of the content {{done}} | |||
** Finally: Creating the first prototype zim files | |||
* Rest of the second week (27.02 - 2.03.2014): | |||
** Improving everything | |||
** Fixing possible bugs | |||
** Possible other things: | |||
*** Implement a way to play html5 videos on the Android version of Kiwix (Bug can be found [http://sourceforge.net/p/kiwix/bugs/465/ here]) | |||
== Implementation == | |||
==== Networking ==== | |||
The networking library we are going to use will be [http://requests.readthedocs.org/en/latest/ requests]. Requests is pretty easy to use and straightforward. | |||
==== Scraping ==== | |||
The scraping library, that we are going to use will be [http://www.crummy.com/software/BeautifulSoup/ Beautifulsoup4 ]. You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions. | |||
==== Downloading Videos ==== | |||
Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found [http://download.ted.com/talks/YannDallAglio_2012X-light.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 here] | |||
==== Downloading subtitles ==== | |||
The subtitles of videos are harder to get. They are all available on [http://www.amara.org/en/teams/ted/videos/ here] in multiple formats. We will use the caption format WebVTT. | |||
==== Building HTML sites out of the scraped content ==== | |||
We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it. Out of all the possibilities [http://jinja.pocoo.org/docs/ Jinja2] seems to be the best library for that. | |||
==== Javascript client side filter/search solution ==== | |||
... | |||
==== Templating solution to create pages ==== |
edits