Difference between revisions of "TED"

Jump to navigation Jump to search
3,142 bytes added ,  10 years ago
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''TED''' (Technology, Entertainment, Design) is a global set of conferences under the slogan "ideas worth spreading". They address a wide range of topics within the research and practice of science and culture, often through storytelling. The speakers are given a maximum of 18 minutes to present their ideas in the most innovative and engaging ways they can. Its web site is www.ted.com. The purpose of this project is to create a sustainable solution to create a ZIM file providing the TED and TEDx videos in a similar manner like ted.com
'''TED''' (Technology, Entertainment, Design) is a global set of conferences under the slogan "ideas worth spreading". They address a wide range of topics within the research and practice of science and culture, often through storytelling. The speakers are given a maximum of 18 minutes to present their ideas in the most innovative and engaging ways they can. Its web site is [http://www.ted.com www.ted.com].
 
The purpose of this project is to create a sustainable solution to create ZIM files providing the TED and TEDx videos in a similar manner like ted.com


== Goals ==
== Goals ==
Line 8: Line 10:
* The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....)
* The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....)


== One way to achieve it ==
== Preparations ==
As it currently stands there is a redesign of the TED site, that is currently available at http://new.ted.com. <br>
We should focus on scraping that site, because the old one will eventuelly get discontinued.
 
== Site structure ==
 
http://new.ted.com/talks gives you a list of all the TED talks sorted by playlists and categories.
 
http://new.ted.com/talks/browse?sort=popular will give you list of all the TED talks in one place sorted by popularity. <br>
It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on.
 
== Ideas to achieve it ==
# Retrieve the list of TED(x) presentations with medatas in a local database
# Retrieve the list of TED(x) presentations with medatas in a local database
## A whole list of the available TED talks is available [http://www.ted.com/talks/quick-list here] (official) or [http://goo.gl/lx9Ro here] (unofficial)
## A whole list of the available TED talks is available [http://www.ted.com/talks/quick-list here] (official) or [http://goo.gl/lx9Ro here] (unofficial)
## TEDx talks by language are available [http://tedxtalks.ted.com/pages/languages here].  
## TEDx talks by language are available [http://tedxtalks.ted.com/pages/languages here].  
# Download videos and re-encode them if necessary
# Download videos and re-encode them if necessary
# Retrieve the video subtitle files
# Retrieve the video subtitle files from www.amara.org
## Subtitle don't make so much sense for TEDx
## Subtitle don't make so much sense for TEDx
## TED has a translation program [http://www.ted.com/OpenTranslationProject here]
## TED has a translation program [http://www.ted.com/OpenTranslationProject here]
# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried)
# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) with Jinja2
## Interesting to [http://diveintohtml5.info/storage.html read this] to get an idea how to store a database client side.
## Interesting to [http://diveintohtml5.info/storage.html read this] to get an idea how to store a database client side.
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
# Run zimwriterfs to create the corresponding ZIM file of your target directory
# Run zimwriterfs to create the corresponding ZIM file of your target directory
== Agenda ==
* First three days (17.-19.02.2014):
** Planning on how this project can be realized {{done}}
** Creation of a concept including a conceptional Zim file, that demonstrates the very basics of this project {{done}}
* Rest of the first week (20. - 23.02.2104):
** Collection of all the data {{done}}
*** Writing the Scraper, that scrapes TED.com  {{done}}
*** Writing the Scraper, that scrapes the TED translation page on ww.amara.org {{done}}
*** Writing the html templates {{done}}
*** Writing a python script, that dumps the scraped data into the HTML pages, creating static content {{done}}
* First three days of the second week (24. - 26.02.2014):
** Implementing the local database, that manages all the content  {{done}}
** Implementing the search engine in Javascript, that allows the user to search through all of the content {{done}}
** Finally: Creating the first prototype zim files
* Rest of the second week (27.02 - 2.03.2014):
** Improving everything
** Fixing possible bugs
** Possible other things:
*** Implement a way to play html5 videos on the Android version of Kiwix (Bug can be found [http://sourceforge.net/p/kiwix/bugs/465/ here])
== Implementation ==
==== Networking ====
The networking library we are going to use will be [http://requests.readthedocs.org/en/latest/ requests]. Requests is pretty easy to use and straightforward.
==== Scraping ====
The scraping library, that we are going to use will be [http://www.crummy.com/software/BeautifulSoup/ Beautifulsoup4 ]. You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions.
==== Downloading Videos ====
Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found [http://download.ted.com/talks/YannDallAglio_2012X-light.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 here]
==== Downloading subtitles ====
The subtitles of videos are harder to get. They are all available on [http://www.amara.org/en/teams/ted/videos/ here] in multiple formats. We will use the caption format WebVTT.
==== Building HTML sites out of the scraped content  ====
We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it. Out of all the possibilities [http://jinja.pocoo.org/docs/ Jinja2] seems to be the best library for that.
==== Javascript client side filter/search solution ====
...
==== Templating solution to create pages ====
21

edits

Navigation menu