Difference between revisions of "TED"

TED (view source)

Revision as of 04:02, 18 February 2014

1,817 bytes added , 10 years ago

no edit summary

RashiqAhmad

21

edits

@@ Line 7: / Line 7: @@
 * Videos should be available in HTML5 and subtitles need to be supported
 * The ZIM should provide a simple filtering/search solution to find content (by author, language, title, conference, topic, ....)
+== Preparations ==
+As it currently stands there is a redesign of the TED site, that is currently available at http://new.ted.com. <br>
+We should focus on scraping that site, because the old one will eventuelly get discontinued.
+== Site structure ==
+http://new.ted.com/talks gives you a list of all the TED talks sorted by playlists and categories.
+http://new.ted.com/talks/browse?sort=popular will give you list of all the TED talks in one place sorted by popularity. <br>
+It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on.
+== Libraries ==
+=== Networking ===
+The networking library we are going to use will be [http://requests.readthedocs.org/en/latest/ requests].<br>
+Requests is pretty easy to use and straightforward.
+=== Scraping ===
+The scraping library, that we are going to use will be [http://www.crummy.com/software/BeautifulSoup/ Beautifulsoup4 ].<br>
+You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions.
+=== Downloading Videos ===
+Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found [http://download.ted.com/talks/YannDallAglio_2012X-light.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 here]
+The subtitles of videos are harder to get. They are all available on [http://www.amara.org/en/teams/ted/videos/ here] in multiple formats. We will use the caption format SRT. <br>
+=== Building HTML sites out of the scraped content  ===
+We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it. <br>
+Out of all the possibilities [http://jinja.pocoo.org/docs/ Jinja2] seems to be the best library for that.
 == One way to achieve it ==
@@ Line 13: / Line 43: @@
 ## TEDx talks by language are available [http://tedxtalks.ted.com/pages/languages here].
 # Download videos and re-encode them if necessary
-# Retrieve the video subtitle files
+# Retrieve the video subtitle files from www.amara.org
 ## Subtitle don't make so much sense for TEDx
 ## TED has a translation program [http://www.ted.com/OpenTranslationProject here]
-# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried)
+# Create the necessary templates of the index web pages (For the search/filter feature, a javascript client side solution should be tried) with Jinja2
 ## Interesting to [http://diveintohtml5.info/storage.html read this] to get an idea how to store a database client side.
 # Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory
 # Run zimwriterfs to create the corresponding ZIM file of your target directory

Difference between revisions of "TED"

TED (view source)

Revision as of 04:02, 18 February 2014

Navigation menu

Search