Posts Tagged ‘pipes’

Screen Scraping and Creating a Feed with YQL and Yahoo! Pipes

Thursday, June 25th, 2009

Dav Glass recently asked if anyone could build him an RSS feed of the YUI download page using YQL and Pipes. I was somewhat confident of my Pipes and YQL skills so I decided to take a crack at it. The first thing I did was to scrape the page’s contents using YQL.

Scraping with YQL

The YQL Console allows access to several Data Tables, among which is a whole set of tables under the category “data” that aren’t tables at all. They’re more like APIs allowing you to connect to some pretty powerful fetchers and parsers. I selected the one named “html” which allows you to access any html document on the web as a data source and parse it using xpath.

Here’s the query I fed it:

select * from html where url="http://yuilibrary.com/downloads/?show=yui2" and xpath='//table[thead/tr/th/h2/@id="yui2"]/tbody//tr[@class="even" or @class="odd"]'

which returns the following fragment:

<tr class="even">
    <td><a title="Version 2.7.0b" href="yui2/yui_2.7.0b.zip">Version 2.7.0b</a></td>
    <td>02/19/2009</td>
    <td><em class="md5">90778a161ce9108a23a590e5198b8116</em></td>
</tr>
<tr class="odd">
    <td><a title="Version 2.6.0" href="yui2/yui_2.6.0.zip">Version 2.6.0</a></td>
    <td>10/01/2008</td>
    <td><em class="md5">41bed4b882c9148cebff5dd1a0dd8727</em></td>
</tr>
<tr class="even">
    <td><a title="Version 2.5.2" href="yui2/yui_2.5.2.zip">Version 2.5.2</a></td>
    <td>05/28/2008</td>
    <td><em class="md5">eaadfcbcb651c50092bb679266aa3c20</em></td>
</tr>

Creating a Feed with Pipes

You may have noticed that I already had a problem. The link text for each of the items was not descriptive enough. If I used them as-is, people subscribing to the feed would see stuff like: “1.0.0b1″ and “3.0.0 Beta 1″, which would just be confusing. Instead I wanted them to look like this: “YUI Builder – 1.0.0b1″ and “YUI 3 – 3.0.0 Beta 1″. That’s why my query only targets one table at a time by its id. That way I can isolate each table and prefix each link text with the correct product’s name.

Here‘s what doing that looked like:

pipes

You’ll note that the first thing I did after fetching the rows was to prepend the link text. Here’s where things got tricky. On my first attempt to do this, back when Dav first asked for it, I got stuck here. I wasn’t able to target the first td in the three that are contained in each row. Thanks to Nagesh Susarla of the YQL/Pipes team however, I was able to target the td I wanted by including its index number in the field targeting string like so: item.td.0.a.content. By including the 0 in there, I’m telling YQL that I want the first td in the set.

After prepending the link text with the product name, I use a union operator to put the contents of all the queries together, since now I don’t need them apart anymore. Then it’s on to putting the yuilibrary.com domain name in front of the href attribute values since they weren’t there. Then I clean everything up by looping over my rows and creating clean items out of them using an Item Builder. (I could have just renamed or copied the fields I wanted and stayed with the rows as-is, but then I’d be delivering a lot of unnecessary junk in my feed.) Finally, I run the whole thing through a sort operator on the pubDate field (not in the screen capture) and output the result in a feed.

In the end, a YQL/Pipes team member gave Dav what he needed a lot faster than I could, but hey I learned something in the process. This technique comes in pretty handy when a page doesn’t have a feed and you want to track changes on it. So now that you know how to do it, go out there and rip the web apart!

Update: I’m proud to say that the feed that I created has been copied and made the feed for the YUI Downloads page.

Share Smarter, Not Harder: How I Feed Twitter

Tuesday, June 23rd, 2009

I was just a kid watching DuckTales when I heard Scrooge McDuck say “work smarter, not harder” and that philosophy stuck. Maybe it’s because I’m predisposed to avoid manual labour, who knows. All I know is I’m always on the lookout to simplify process. So when it came to sharing items of interest with friends, I grew tired of manually sending links one at a time over IM. Broadcasting over Twitter was slightly better but I ran the risk of flooding people with tweets. After all, we don’t consume media one bite-sized chunk at a time in neatly spaced intervals. We get random links from friends, watch video clips two or three at a time, browse online stores full of really cool things that we want, and so on.

What to do? The solution came to me in the form of a service called Twitterfeed. Twitterfeed lets you plug a feed into your Twitter account so that any time there’s a new item in the feed, it gets Tweeted to your account. Great, but there were still some issues to overcome, such as drawing all of my points of interest into one feed and then not flooding my Twitter account with a gzillion tweets.

Step 1: The Funnel

Using a combination of YQL and Yahoo! Pipes, I pulled in my Amazon Wish List, Google Reader shared items, YouTube favourites, blog posts on my site, and Flickr favourites into one giant feed.

twitter-funnel

Step 2: Timed Release

Once Pipes gave me a consolidated feed, I was able to plug it into Twitterfeed. But I still had to make sure that Twitterfeed didn’t dump a bunch of tweets into Twitter at the same time. Luckily, Twitterfeed is customizable and I was able to tell it to release only one tweet every two hours. This way, whenever I marked several items in Google Reader as “shared” they got banked up and sent out one at a time over the course of the day.

twitterfeed

Step 3: Letting People Know it’s Automated

I came across a bit of an odd phenomenon with this technique however. People thought I never slept because I was posting tweets like clockwork at all hours. It might be fun if people think you aren’t sleeping, but it isn’t as fun when they think you aren’t working. So I added “[AutoTweet]” to the end of the tweets that Twitterfeed posted (the service allows you to prefix and suffix tweets with text). That way, people could tell right away that I didn’t manually send out the tweet.

autotweet

That’s it! I didn’t get into too much technical detail here, but if you’re interested let me know and I’ll post a much more detailed writeup.