Difference between revisions of "TED"
RashiqAhmad (talk | contribs) |
|||
Line 7: | Line 7: | ||
* Videos should be available in HTML5 and subtitles need to be supported | * Videos should be available in HTML5 and subtitles need to be supported | ||
* The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....) | * The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....) | ||
== Preparations == | |||
As it currently stands there is a redesign of the TED site, that is currently available at http://new.ted.com. <br> | |||
We should focus on scraping that site, because the old one will eventuelly get discontinued. | |||
== Site structure == | |||
http://new.ted.com/talks gives you a list of all the TED talks sorted by playlists and categories. | |||
http://new.ted.com/talks/browse?sort=popular will give you list of all the TED talks in one place sorted by popularity. <br> | |||
It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on. | |||
== Libraries == | |||
=== Networking === | |||
The networking library we are going to use will be [http://requests.readthedocs.org/en/latest/ requests].<br> | |||
Requests is pretty easy to use and straightforward. | |||
=== Scraping === | |||
The scraping library, that we are going to use will be [http://www.crummy.com/software/BeautifulSoup/ Beautifulsoup4 ].<br> | |||
You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions. | |||
=== Downloading Videos === | |||
Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found [http://download.ted.com/talks/YannDallAglio_2012X-light.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 here] | |||
The subtitles of videos are harder to get. They are all available on [http://www.amara.org/en/teams/ted/videos/ here] in multiple formats. We will use the caption format SRT. <br> | |||
=== Building HTML sites out of the scraped content === | |||
We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it. <br> | |||
Out of all the possibilities [http://jinja.pocoo.org/docs/ Jinja2] seems to be the best library for that. | |||
== One way to achieve it == | == One way to achieve it == | ||
Line 13: | Line 43: | ||
## TEDx talks by language are available [http://tedxtalks.ted.com/pages/languages here]. | ## TEDx talks by language are available [http://tedxtalks.ted.com/pages/languages here]. | ||
# Download videos and re-encode them if necessary | # Download videos and re-encode them if necessary | ||
# Retrieve the video subtitle files | # Retrieve the video subtitle files from www.amara.org | ||
## Subtitle don't make so much sense for TEDx | ## Subtitle don't make so much sense for TEDx | ||
## TED has a translation program [http://www.ted.com/OpenTranslationProject here] | ## TED has a translation program [http://www.ted.com/OpenTranslationProject here] | ||
# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) | # Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) with Jinja2 | ||
## Interesting to [http://diveintohtml5.info/storage.html read this] to get an idea how to store a database client side. | ## Interesting to [http://diveintohtml5.info/storage.html read this] to get an idea how to store a database client side. | ||
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | # Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | ||
# Run zimwriterfs to create the corresponding ZIM file of your target directory | # Run zimwriterfs to create the corresponding ZIM file of your target directory |
Revision as of 04:02, 18 February 2014
TED (Technology, Entertainment, Design) is a global set of conferences under the slogan "ideas worth spreading". They address a wide range of topics within the research and practice of science and culture, often through storytelling. The speakers are given a maximum of 18 minutes to present their ideas in the most innovative and engaging ways they can. Its web site is www.ted.com. The purpose of this project is to create a sustainable solution to create a ZIM file providing the TED and TEDx videos in a similar manner like ted.com
Goals
- A script (python) able to create easily following ZIM files of the TED and TEDx videos with the possibility to filter by language/conference/topic
- A list of TED talks can be found here
- The data should be scraped from ted.com.
- Videos should be available in HTML5 and subtitles need to be supported
- The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....)
Preparations
As it currently stands there is a redesign of the TED site, that is currently available at http://new.ted.com.
We should focus on scraping that site, because the old one will eventuelly get discontinued.
Site structure
http://new.ted.com/talks gives you a list of all the TED talks sorted by playlists and categories.
http://new.ted.com/talks/browse?sort=popular will give you list of all the TED talks in one place sorted by popularity.
It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on.
Libraries
Networking
The networking library we are going to use will be requests.
Requests is pretty easy to use and straightforward.
Scraping
The scraping library, that we are going to use will be Beautifulsoup4 .
You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions.
Downloading Videos
Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found here
The subtitles of videos are harder to get. They are all available on here in multiple formats. We will use the caption format SRT.
Building HTML sites out of the scraped content
We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it.
Out of all the possibilities Jinja2 seems to be the best library for that.
One way to achieve it
- Retrieve the list of TED(x) presentations with medatas in a local database
- Download videos and re-encode them if necessary
- Retrieve the video subtitle files from www.amara.org
- Subtitle don't make so much sense for TEDx
- TED has a translation program here
- Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) with Jinja2
- Interesting to read this to get an idea how to store a database client side.
- Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
- Run zimwriterfs to create the corresponding ZIM file of your target directory