Screen Scraping and Creating a Feed with YQL and Yahoo! Pipes

Dav Glass recently asked if anyone could build him an RSS feed of the YUI download page using YQL and Pipes. I was somewhat confident of my Pipes and YQL skills so I decided to take a crack at it. The first thing I did was to scrape the page’s contents using YQL.

Scraping with YQL

The YQL Console allows access to several Data Tables, among which is a whole set of tables under the category “data” that aren’t tables at all. They’re more like APIs allowing you to connect to some pretty powerful fetchers and parsers. I selected the one named “html” which allows you to access any html document on the web as a data source and parse it using xpath.

Here’s the query I fed it:

select * from html where url="http://yuilibrary.com/downloads/?show=yui2" and xpath='//table[thead/tr/th/h2/@id="yui2"]/tbody//tr[@class="even" or @class="odd"]'

which returns the following fragment:

<tr class="even">
    <td><a title="Version 2.7.0b" href="yui2/yui_2.7.0b.zip">Version 2.7.0b</a></td>
    <td>02/19/2009</td>
    <td><em class="md5">90778a161ce9108a23a590e5198b8116</em></td>
</tr>
<tr class="odd">
    <td><a title="Version 2.6.0" href="yui2/yui_2.6.0.zip">Version 2.6.0</a></td>
    <td>10/01/2008</td>
    <td><em class="md5">41bed4b882c9148cebff5dd1a0dd8727</em></td>
</tr>
<tr class="even">
    <td><a title="Version 2.5.2" href="yui2/yui_2.5.2.zip">Version 2.5.2</a></td>
    <td>05/28/2008</td>
    <td><em class="md5">eaadfcbcb651c50092bb679266aa3c20</em></td>
</tr>

Creating a Feed with Pipes

You may have noticed that I already had a problem. The link text for each of the items was not descriptive enough. If I used them as-is, people subscribing to the feed would see stuff like: “1.0.0b1″ and “3.0.0 Beta 1″, which would just be confusing. Instead I wanted them to look like this: “YUI Builder – 1.0.0b1″ and “YUI 3 – 3.0.0 Beta 1″. That’s why my query only targets one table at a time by its id. That way I can isolate each table and prefix each link text with the correct product’s name.

Here‘s what doing that looked like:

pipes

You’ll note that the first thing I did after fetching the rows was to prepend the link text. Here’s where things got tricky. On my first attempt to do this, back when Dav first asked for it, I got stuck here. I wasn’t able to target the first td in the three that are contained in each row. Thanks to Nagesh Susarla of the YQL/Pipes team however, I was able to target the td I wanted by including its index number in the field targeting string like so: item.td.0.a.content. By including the 0 in there, I’m telling YQL that I want the first td in the set.

After prepending the link text with the product name, I use a union operator to put the contents of all the queries together, since now I don’t need them apart anymore. Then it’s on to putting the yuilibrary.com domain name in front of the href attribute values since they weren’t there. Then I clean everything up by looping over my rows and creating clean items out of them using an Item Builder. (I could have just renamed or copied the fields I wanted and stayed with the rows as-is, but then I’d be delivering a lot of unnecessary junk in my feed.) Finally, I run the whole thing through a sort operator on the pubDate field (not in the screen capture) and output the result in a feed.

In the end, a YQL/Pipes team member gave Dav what he needed a lot faster than I could, but hey I learned something in the process. This technique comes in pretty handy when a page doesn’t have a feed and you want to track changes on it. So now that you know how to do it, go out there and rip the web apart!

Update: I’m proud to say that the feed that I created has been copied and made the feed for the YUI Downloads page.

Tags:


redtube downloader redtube video download redtube video download redtube videos download redtube xvideos xvideos downloader xvideos video download xvideos video download xvideos videos download xvideos metacafe metacafe downloader metacafe video download metacafe video download metacafe videos download metacafe xhamster xhamster downloader xhamster video download xhamster video download xhamster videos download xhamster daylimotion daylimotion downloader daylimotion video download daylimotion video download daylimotion videos download daylimotion facebook facebook downloader facebook video download facebook video download facebook videos download facebook badjojo badjojo downloader badjojo video download badjojo video download badjojo videos download badjojo dachix dachix downloader dachix video download dachix video download dachix videos download dachix keezmovies keezmovies downloader keezmovies video download keezmovies video download keezmovies videos download keezmovies shufuni shufuni downloader shufuni video download shufuni video download shufuni videos download shufuni vimeo vimeo downloader vimeo video download vimeo video download vimeo videos download vimeo veoh veoh downloader veoh video download veoh video download veoh videos download veoh theync theync downloader theync video download theync video download theync videos download theync ehow ehow downloader ehow video download ehow video download ehow videos download ehow xtube xtube downloader xtube video download xtube video download xtube videos download xtube stupidvideos stupidvideos downloader stupidvideos video download stupidvideos video download stupidvideos videos download stupidvideos AccessHollywood AccessHollywood downloader AccessHollywood video download AccessHollywood video download AccessHollywood videos download AccessHollywood ClipJunkie ClipJunkie downloader ClipJunkie video download ClipJunkie video download ClipJunkie videos download ClipJunkie Ebaumsworld Ebaumsworld downloader Ebaumsworld video download Ebaumsworld video download Ebaumsworld videos download Ebaumsworld FunnyOrDie FunnyOrDie downloader FunnyOrDie video download FunnyOrDie video download FunnyOrDie videos download FunnyOrDie FootyTube FootyTube downloader FootyTube video download FootyTube video download FootyTube videos download FootyTube pornhub pornhub downloader pornhub video download pornhub video download pornhub videos download pornhub