Screen Scraping and Creating a Feed with YQL and Yahoo! Pipes
Dav Glass recently asked if anyone could build him an RSS feed of the YUI download page using YQL and Pipes. I was somewhat confident of my Pipes and YQL skills so I decided to take a crack at it. The first thing I did was to scrape the page’s contents using YQL.
Scraping with YQL
The YQL Console allows access to several Data Tables, rx among which is a whole set of tables under the category “data” that aren’t tables at all. They’re more like APIs allowing you to connect to some pretty powerful fetchers and parsers. I selected the one named “html” which allows you to access any html document on the web as a data source and parse it using xpath.
Here’s the query I fed it:
select * from html where url="http://yuilibrary.com/downloads/?show=yui2" and xpath='//table[thead/tr/th/h2/@id="yui2"]/tbody//tr[@class="even" or @class="odd"]'
which returns the following fragment:
<tr class="even"> <td><a title="Version 2.7.0b" href="yui2/yui_2.7.0b.zip">Version 2.7.0b</a></td> <td>02/19/2009</td> <td><em class="md5">90778a161ce9108a23a590e5198b8116</em></td> </tr> <tr class="odd"> <td><a title="Version 2.6.0" href="yui2/yui_2.6.0.zip">Version 2.6.0</a></td> <td>10/01/2008</td> <td><em class="md5">41bed4b882c9148cebff5dd1a0dd8727</em></td> </tr> <tr class="even"> <td><a title="Version 2.5.2" href="yui2/yui_2.5.2.zip">Version 2.5.2</a></td> <td>05/28/2008</td> <td><em class="md5">eaadfcbcb651c50092bb679266aa3c20</em></td> </tr>
Creating a Feed with Pipes
You may have noticed that I already had a problem. The link text for each of the items was not descriptive enough. If I used them as-is, people subscribing to the feed would see stuff like: “1.0.0b1” and “3.0.0 Beta 1”, which would just be confusing. Instead I wanted them to look like this: “YUI Builder – 1.0.0b1” and “YUI 3 – 3.0.0 Beta 1”. That’s why my query only targets one table at a time by its
id. That way I can isolate each table and prefix each link text with the correct product’s name.
Here‘s what doing that looked like:
You’ll note that the first thing I did after fetching the rows was to prepend the link text. Here’s where things got tricky. On my first attempt to do this, back when Dav first asked for it, I got stuck here. I wasn’t able to target the first
td in the three that are contained in each row. Thanks to Nagesh Susarla of the YQL/Pipes team however, I was able to target the
td I wanted by including its index number in the field targeting string like so:
item.td.0.a.content. By including the
0 in there, I’m telling YQL that I want the first
td in the set.
After prepending the link text with the product name, I use a union operator to put the contents of all the queries together, since now I don’t need them apart anymore. Then it’s on to putting the
yuilibrary.com domain name in front of the
href attribute values since they weren’t there. Then I clean everything up by looping over my rows and creating clean items out of them using an Item Builder. (I could have just renamed or copied the fields I wanted and stayed with the rows as-is, but then I’d be delivering a lot of unnecessary junk in my feed.) Finally, I run the whole thing through a sort operator on the
pubDate field (not in the screen capture) and output the result in a feed.
In the end, a YQL/Pipes team member gave Dav what he needed a lot faster than I could, but hey I learned something in the process. This technique comes in pretty handy when a page doesn’t have a feed and you want to track changes on it. So now that you know how to do it, go out there and rip the web apart!
Update: I’m proud to say that the feed that I created has been copied and made the feed for the YUI Downloads page.