Difference between revisions of "TED"

Jump to navigation Jump to search
1,304 bytes added ,  10 years ago
 
(10 intermediate revisions by 2 users not shown)
Line 21: Line 21:
It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on.
It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on.


== Libraries ==
== Ideas to achieve it ==
 
=== Networking ===
The networking library we are going to use will be [http://requests.readthedocs.org/en/latest/ requests].<br>
Requests is pretty easy to use and straightforward.
 
=== Scraping ===
The scraping library, that we are going to use will be [http://www.crummy.com/software/BeautifulSoup/ Beautifulsoup4 ].<br>
You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions.
 
=== Downloading Videos ===
Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found [http://download.ted.com/talks/YannDallAglio_2012X-light.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 here]
 
The subtitles of videos are harder to get. They are all available on [http://www.amara.org/en/teams/ted/videos/ here] in multiple formats. We will use the caption format SRT. <br>
 
=== Building HTML sites out of the scraped content  ===
We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it. <br>
Out of all the possibilities [http://jinja.pocoo.org/docs/ Jinja2] seems to be the best library for that.
 
== One way to achieve it ==
# Retrieve the list of TED(x) presentations with medatas in a local database
# Retrieve the list of TED(x) presentations with medatas in a local database
## A whole list of the available TED talks is available [http://www.ted.com/talks/quick-list here] (official) or [http://goo.gl/lx9Ro here] (unofficial)
## A whole list of the available TED talks is available [http://www.ted.com/talks/quick-list here] (official) or [http://goo.gl/lx9Ro here] (unofficial)
Line 52: Line 33:
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
# Run zimwriterfs to create the corresponding ZIM file of your target directory
# Run zimwriterfs to create the corresponding ZIM file of your target directory
== Agenda ==
* First three days (17.-19.02.2014):
** Planning on how this project can be realized {{done}}
** Creation of a concept including a conceptional Zim file, that demonstrates the very basics of this project {{done}}
* Rest of the first week (20. - 23.02.2104):
** Collection of all the data {{done}}
*** Writing the Scraper, that scrapes TED.com  {{done}}
*** Writing the Scraper, that scrapes the TED translation page on ww.amara.org {{done}}
*** Writing the html templates {{done}}
*** Writing a python script, that dumps the scraped data into the HTML pages, creating static content {{done}}
* First three days of the second week (24. - 26.02.2014):
** Implementing the local database, that manages all the content  {{done}}
** Implementing the search engine in Javascript, that allows the user to search through all of the content {{done}}
** Finally: Creating the first prototype zim files
* Rest of the second week (27.02 - 2.03.2014):
** Improving everything
** Fixing possible bugs
** Possible other things:
*** Implement a way to play html5 videos on the Android version of Kiwix (Bug can be found [http://sourceforge.net/p/kiwix/bugs/465/ here])
== Implementation ==
==== Networking ====
The networking library we are going to use will be [http://requests.readthedocs.org/en/latest/ requests]. Requests is pretty easy to use and straightforward.
==== Scraping ====
The scraping library, that we are going to use will be [http://www.crummy.com/software/BeautifulSoup/ Beautifulsoup4 ]. You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions.
==== Downloading Videos ====
Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found [http://download.ted.com/talks/YannDallAglio_2012X-light.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 here]
==== Downloading subtitles ====
The subtitles of videos are harder to get. They are all available on [http://www.amara.org/en/teams/ted/videos/ here] in multiple formats. We will use the caption format WebVTT.
==== Building HTML sites out of the scraped content  ====
We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it. Out of all the possibilities [http://jinja.pocoo.org/docs/ Jinja2] seems to be the best library for that.
==== Javascript client side filter/search solution ====
...
==== Templating solution to create pages ====
21

edits

Navigation menu