8,236
edits
Line 21: | Line 21: | ||
It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on. | It would be best to scrape this site and add the metadata (Category, playlist etc.) by ourselves later on. | ||
== | == Ideas to achieve it == | ||
# Retrieve the list of TED(x) presentations with medatas in a local database | # Retrieve the list of TED(x) presentations with medatas in a local database | ||
## A whole list of the available TED talks is available [http://www.ted.com/talks/quick-list here] (official) or [http://goo.gl/lx9Ro here] (unofficial) | ## A whole list of the available TED talks is available [http://www.ted.com/talks/quick-list here] (official) or [http://goo.gl/lx9Ro here] (unofficial) | ||
Line 52: | Line 33: | ||
# Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | # Fill the HTML templates with the data from the XML/RDF and write the index pages in a target directory | ||
# Run zimwriterfs to create the corresponding ZIM file of your target directory | # Run zimwriterfs to create the corresponding ZIM file of your target directory | ||
== Agenda == | |||
... | |||
== Implementation == | |||
==== Networking ==== | |||
The networking library we are going to use will be [http://requests.readthedocs.org/en/latest/ requests]. Requests is pretty easy to use and straightforward. | |||
==== Scraping ==== | |||
The scraping library, that we are going to use will be [http://www.crummy.com/software/BeautifulSoup/ Beautifulsoup4 ]. You can easily go through all nodes of an HTML document with it. HTML elements can be either selected by CSS selectors or by regular expressions. | |||
==== Downloading Videos ==== | |||
Downloading videos from TED is easy and straightforward. An example of an URL to a video can be found [http://download.ted.com/talks/YannDallAglio_2012X-light.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 here] | |||
==== Downloading subtitles ==== | |||
The subtitles of videos are harder to get. They are all available on [http://www.amara.org/en/teams/ted/videos/ here] in multiple formats. We will use the caption format SRT. <br> | |||
==== Building HTML sites out of the scraped content ==== | |||
We want to 'export' our scraped data to html, so we can run the zim tool on it and create compressed zim files off it. <br> | |||
Out of all the possibilities [http://jinja.pocoo.org/docs/ Jinja2] seems to be the best library for that. |
edits