The day after I published this post in 2007, I published a podcast feed for the NPR radio show This American Life, which used the Google Cache method to build an RSS feed for the popular radio show (TAL only provides the most recent couple shows available to the public for free via a podcast, rather than older shows).
Soon after, I took that podcast feed down after I was directly contacted by someone working for the radio show.
Fast forward five years, and Craigslist is now suing a company called 3Taps for using a method which appears to be surprisingly similar to the technique I described in 2007.
As in 2007, the legal issues surrounding this technique are unclear at best. Eric DeMenthon, the founder of PadMapper, a company using 3Taps data which was also sued by Craigslist told Ars Technica that "Since I'm not actually re-posting the content of the listings, just the facts about the listings, I figured (with legal advice) that there was no real copyright issue there."
Perhaps the courts will finally resolve this interesting legal question.
Check out the RSS/XML cache fetcher here.
Many popular websites now provide RSS/XML feeds for all kinds of data relating to their website. Digg headlines, CNN news, Craigslist for sale items, and millions of blogs.
What happens when you want to build on that data in a way that the website owner didn't plan for, and probably won't approve of?
Since they're providing an RSS feed to the general public, it's unlikely that they can stop you from downloading it. Depending on how you are using the data, they may or may not be able to use copyright law against you. However, they may be able to claim that you're hammering their servers , and thus taking up their resources. In such a situation, normally, if they tell you to stop and you continue, you could be in trouble. This comes under a claim of trespass to chattel - an arcane area of the law that has been used by companies such as Ebay and American Airlines to stop people from scraping their website.
Google has a very popular online RSS reader service, unsurprisingly named Google Reader. Instead of requesting any given website's RSS feed every time one of their millions of users wishes to view this feed in their browser, Google instead serves the user a cached copy. Google's feed crawler ("Feedfetcher") software retrieves feeds from most sites less than once every hour. Some frequently updated sites may be refreshed more often.
The benefits to website owners is clear: Instead of getting hit once every hour by hundreds of thousands of people, they get visited maybe once every hour by Google's software, who then deal with the bandwidth issues involved in getting that data to customers.
Google is even nice enough to modify the user agent string that the feed crawler sends to webservers, to tell website owners exactly how many people have subscribed to the feed.
If you are logged into a Google service (such as Gmail), you can view Google's cached copy of any RSS feed by going to: http://www.google.com/reader/atom/feed/the_feed_you_want
For example, the This American Life podcast feed can be seen by going to: http://www.google.com/reader/atom/feed/http://feeds.thisamericanlife.org/talpodcast
Unfortunately, this cannot be automated with a script - as it requires that you login to a Google Account first. While this can be automated somewhat using the Google provided ClientLogin/AuthSub mechanisms, it would still mean that you'd be scraping the query results. ClientLogin (I believe) also requires that you solve a CAPTCHA, which isn't going to work for a ruby script running on a remote server. Google's terms of service forbid users from automating and scraping data returned from Google queries. If you want to get data from Google with a script, you need to use one of their APIs.
Niall Kennedy reverse engineered the API a bit, and figured out how to get a JSON encoded version of a cached RSS feed from Google's servers. When Niall first started looking into the guts of the Reader and its APIs late last year, members of the Google Reader development team left comments on his blog commending him on his reverse engineering, and provided key bits of information. Thus, while the release of this information isn't 100% sanctioned by Google, its fair to assume that Google is aware it is out there. Most importantly, this method uses the API (and requires an API key that you can request from Google) - which means that in using it, I'm safer and more legit.
Thus, I whipped up a ruby script that will parse the JSON output from Google, and give you a real RSS feed. A live demo, and source code of the RSS/XML cache fetcher can be downloaded here.
Why is this useful? It means that you can scrape someone's RSS/XML feed without ever going to their website. While I'm no lawyer (and this is certainly not legal advice) - I'd imagine that it'd be fairly difficult for anyone to attempt a trespass to chattel claim, since they'll be unable to prove any harm, or consumption of resources.
Google has millions of customers. It's quite likely that their bot is already crawling most popular RSS feeds. Thus, it's very very unlikely that by pulling up a copy of the feed, that you'll cause Google to go and fetch a new copy.
Many websites already provide different (unpassword protected) data to Google than they do to anyone else visting their site. It may be possible, by using Google's RSS cache, to take advantage of this architecture flaw/design decision, and access data that one wouldn't normally be able to get.
Finally, since you won't be hitting anyone's webservers, there is no link (at least in their weblogs) between you and them. They have no way of knowing how often you're accessing their feed. You're hidden amongst the millions of Google users.
i've noticed that a lot of sites that require authentication to their normal content DON'T require authentication when accessing that content their RSS feeds.
this makes their situation even worse. welcome to the world of security in mashups
My name is Tyler and I'm a 2nd year M.S. HCId student in Informatics.
I'm trying to use your Google RSS cached feed script to pull up cached disciplinary actions from the Second Life police blotter feed (http://secondlife.com/community/blotter-rss.php)
However, when I execute your test script I receive a 500 and 404 error.
I know you're crazy busy, but any help you can offer in diagnosing this issue of pulling cached RSS feeds would be greatly appreciated!
thank you for the contribution. very nice and useful article...
Post a Comment