Posts Tagged ‘screen scraping’

Screen Scraping and Creating a Feed with YQL and Yahoo! Pipes

Thursday, June 25th, 2009

Dav Glass recently asked if anyone could build him an RSS feed of the YUI download page using YQL and Pipes. I was somewhat confident of my Pipes and YQL skills so I decided to take a crack at it. The first thing I did was to scrape the page’s contents using YQL.

Scraping with YQL

The YQL Console allows access to several Data Tables, among which is a whole set of tables under the category “data” that aren’t tables at all. They’re more like APIs allowing you to connect to some pretty powerful fetchers and parsers. I selected the one named “html” which allows you to access any html document on the web as a data source and parse it using xpath.

Here’s the query I fed it:

select * from html where url="http://yuilibrary.com/downloads/?show=yui2" and xpath='//table[thead/tr/th/h2/@id="yui2"]/tbody//tr[@class="even" or @class="odd"]'

which returns the following fragment:

<tr class="even">
    <td><a title="Version 2.7.0b" href="yui2/yui_2.7.0b.zip">Version 2.7.0b</a></td>
    <td>02/19/2009</td>
    <td><em class="md5">90778a161ce9108a23a590e5198b8116</em></td>
</tr>
<tr class="odd">
    <td><a title="Version 2.6.0" href="yui2/yui_2.6.0.zip">Version 2.6.0</a></td>
    <td>10/01/2008</td>
    <td><em class="md5">41bed4b882c9148cebff5dd1a0dd8727</em></td>
</tr>
<tr class="even">
    <td><a title="Version 2.5.2" href="yui2/yui_2.5.2.zip">Version 2.5.2</a></td>
    <td>05/28/2008</td>
    <td><em class="md5">eaadfcbcb651c50092bb679266aa3c20</em></td>
</tr>

Creating a Feed with Pipes

You may have noticed that I already had a problem. The link text for each of the items was not descriptive enough. If I used them as-is, people subscribing to the feed would see stuff like: “1.0.0b1″ and “3.0.0 Beta 1″, which would just be confusing. Instead I wanted them to look like this: “YUI Builder – 1.0.0b1″ and “YUI 3 – 3.0.0 Beta 1″. That’s why my query only targets one table at a time by its id. That way I can isolate each table and prefix each link text with the correct product’s name.

Here‘s what doing that looked like:

pipes

You’ll note that the first thing I did after fetching the rows was to prepend the link text. Here’s where things got tricky. On my first attempt to do this, back when Dav first asked for it, I got stuck here. I wasn’t able to target the first td in the three that are contained in each row. Thanks to Nagesh Susarla of the YQL/Pipes team however, I was able to target the td I wanted by including its index number in the field targeting string like so: item.td.0.a.content. By including the 0 in there, I’m telling YQL that I want the first td in the set.

After prepending the link text with the product name, I use a union operator to put the contents of all the queries together, since now I don’t need them apart anymore. Then it’s on to putting the yuilibrary.com domain name in front of the href attribute values since they weren’t there. Then I clean everything up by looping over my rows and creating clean items out of them using an Item Builder. (I could have just renamed or copied the fields I wanted and stayed with the rows as-is, but then I’d be delivering a lot of unnecessary junk in my feed.) Finally, I run the whole thing through a sort operator on the pubDate field (not in the screen capture) and output the result in a feed.

In the end, a YQL/Pipes team member gave Dav what he needed a lot faster than I could, but hey I learned something in the process. This technique comes in pretty handy when a page doesn’t have a feed and you want to track changes on it. So now that you know how to do it, go out there and rip the web apart!

Update: I’m proud to say that the feed that I created has been copied and made the feed for the YUI Downloads page.