slight paranoia: June 2007

Thursday, June 28, 2007

Facebook Cares More About Privacy Than Security

Kudos to Facebook. It looks like they fixed the privacy flaw within hours of Ryan Singel's Wired News story hitting the presses. By the time I woke up this morning, Brandee Barker, Facebook's Director of Corporate Communications had left a comment in my previous blog post to let me know that Facebook's engineers had "updated the advanced search function so that profile information that has been made private by a user, such as gender, religion, and sexual orientation, will not return a result."

Facebook's head privacy engineer, Nico Vera, seems to reside in some sort of Cheney-ish undisclosed location: He's not listed in the corporate phone directory, has instructed Facebook's receptionist to not accept outside calls, and did not reply to my intra-Facebook email.

Luckily - Facebook's PR people are a bit more responsive. It's amazing what a few calls from journalists, and a Boing Boing blog post can do to motivate a company to act quickly.

I tried a few sample searches, and can confirm that Facebook has indeed fixed the bug. My days of searching for private profiles of Facebook users under the age of 21 who list beer or marijuana as one of their interests is over. It's a shame too, as it made for a great "be careful with your information online" example when I lecture undergrads.

While Facebook offers a fantastic level of privacy controls for users, in this case, they clearly erred. Many users had gone to the effort to make their profiles private - and as such, Facebook should have assumed that they would also not wish for their profile information to be data mined through a number of iterative searches. Opt-out privacy is not the way to go - especially for users who have already communicated their intent to have their data be restricted to a small group of friends.

Facebook's engineers fixed the problem within 36 hours of the initial blog post going live, and within a business day of the blog post being linked to from Boing Boing. This rapid response is fantastic, and the Facebook team should be proud of the way they demonstrated their commitment to protecting users' private information.

Contrast this, however, to the Firefox extension vulnerability I made public one month ago. I first notified the Facebook team of the flaw in their Facebook Toolbar product over 2 months ago, on April 21, while the story hit the news a month later on May 30th.

As of this morning, it looks like Facebook has still not fixed their toolbar - such that it continues to seek and download updates from an unauthenticated and insecure server (http://developers.facebook.com/toolbar/updates.rdf). Google and Yahoo who fixed the same problem in their products within a few days.

Yes - being able to quickly and effortlessly find out someones sexuality, religion and drug of choice (when they believe that their profile is private) is a major problem. It's far more serious than the chance that someone in an Internet cafe will take over your laptop - which is probably why Facebook rushed to fix the privacy problem so quickly. However, the security flaw in the Facebook toolbar remains an unresolved issue, and there is simply no excuse for them to wait two months to fix this vulnerability.

Tuesday, June 26, 2007

Go Fish: Is Facebook Violating European Data Protection Rules?

Update: Facebook has fixed the problem. More here

Executive Summary

Using nothing more complex than an advanced search on Facebook's website, an interested person can learn extremely private pieces of information (sexuality, political leanings, religion) that are stored within another user's private Facebook profile.

Users of Facebook can modify the privacy settings for their profile. This will restrict the public viewing and only permit a person's immediate friends to view their profile. While Facebook does allow users to control their profile's existence in search queries, this second preference is not automatically set when a user makes their profile private - and thus many users do not know to do so.

Users cannot be expected to know that the contents of their private profiles can be mined via searches, and thus, very few do set the search permissions associated with their profile.

It is clear however that users intend for their profiles to not be public. A large number of users have gone to the effort to restrict who can view their profiles, but many, unfortunately, remain exposed to a trivial attack.

The Attack

The attack is very simple. For a specific target, one must simply issue an advanced query for the user's name, and any attribute of the profile that one wishes to search.

For example, I've created a new profile in the name of "Chris Privacy Soghoian", who is socially conservative, a Catholic and lives in London, England. His profile privacy has been set so that only his friends may see his profile. Random strangers should not be able to learn anything about the profile - they cannot click on it or view the profile's information.

By issuing an advanced search request for Name: "Chris Privacy Soghoian" and Religion: "Christian - Catholic", one can learn if the profile for that user has listed Catholicism as his religion. Note: To be able to find this profile, you need to be signed in to a facebook account that is a member of the London, UK network. Anyone can join this and other geography based networks, but you must do so first before searching.

If a profile is returned for the search terms requested, one can be sure that the user in question has the relevant information in his profile. It is also easy to see that the profile has been set to private, as the user's name is in black un-clickable text.

Likewise, a similar search for Chris Privacy Soghoian/Buddhist would come back with no results.

This shows how easy it is to learn confidential information that users believe that random strangers cannot learn when they have set their profile to be private/friends-only.

This attack is very similar to the children's game Go Fish. It won't tell you the contents of a profile, but it will provide you with positive or negative confirmation if you know what you're looking for.

So What's The Big Deal?

I originally wrote about this attack in September of last year. I was mainly focused then with finding out the names of students who admitted (in their private profiles) to working at the local strip clubs, and of those students under the age of 21 who listed beer and alcohol in their hobbies.

Stripping and alcohol are interesting enough - and they prove to be fantastic examples when I use them as a demo of "why you need to be careful on Facebook" when lecturing students in my department. However, in focusing on things that would amuse and scare undergrads, I completely missed the hot potato: Sexuality and Religion.

I attended the Privacy Enhancing Technologies workshop last week. While there, I mentioned the Facebook attack to several attendees. A couple of the Europeans were shocked, and told me that Facebook was almost certainly running afoul of a number of European data protection rules.

Privacy is not something that the US government really cares too much - unless of course, you are a politician or supreme court nominee - in which case, they'll pass watertight legislation to protect your ahem "adult" movie rental records.

The Europeans do care about privacy. Sexuality and Religion are bits of information that they consider to be highly sensitive.. and thus, my little go fish attack is now suddenly a lot more important than it was before. Facebook's default search privacy policies may violate European Data Protection rules.

Sample Queries

The following searches will only work if you are signed in to facebook. It is easy to create an account - anyone can make one, and all that you need is a valid email address.

The queries will search everyone within all of your networks - which will include any university/school/employer that you select, as well as a geographic group. There is no proof required of your current location, and so if you wish to search for everyone in France, it's trivial to make a new account/profile located there.

All women interested in women.
All men interested in men.
All Christian men interested in other men.
All Hindu men interested in other men.
All Muslim men interested in other men.
All Jewish men interested in other men.
All Christian women interested in other women.
All Hindu women interested in other women.
All Muslim women interested in other women.
All Jewish women interested in other women.

Clicking on any one of these - at least when you've joined a decent sized network - will return a large group of people - a fairly significant number of whom have profiles that are marked private, which you cannot click on or learn more about. However, by merely appearing in the list of returned profiles, you can be sure that the person's private profile contains information that matches the search terms. This is a problem.

Fixing The Problem

Facebook's privacy policy essentially states that Facebook is not responsible in any case where a user is able to obtain private information about someone else: Although we allow you to set privacy options that limit access to your pages, please be aware that no security measures are perfect or impenetrable... Therefore, we cannot and do not guarantee that User Content you post on the Site will not be viewed by unauthorized persons. We are not responsible for circumvention of any privacy settings or security measures contained on the Site.

Facebook should be commended for the fact that they have implemented a simple technical solution to the problem. Users can control their search privacy settings - and thus control who can see their profile when searches are issued on the facebook site. This feature did not exist when I first described the vulnerability last year.

The problem is that users must opt-in to this more restrictive privacy setting. Users who have gone to the effort of marking their profiles as private (so that others cannot view them) are not clearly warned that other users may be able to learn bits of information by issuing highly specific search queries. Users should not be expected to know or even understand this.

Facebook should change their defaults, and automatically restrict the profile search settings for any user who makes their profile private. Those users who wished to permit strangers to find them in a search could opt in and modify this setting themselves.

Disclosure

Normally, for something like this, I would follow the norms of responsible disclosure (as I did last month with Firefox/Google) and give Facebook advanced notice of my planned release. However, since I first announced this attack on my blog last September, it doesn't really make sense to try and keep it secret. This post doesn't announce anything new - it merely restates the previously described attack in clearer language, provides a couple screenshots and some sample queries that people can click on.

Parsing Privacy Policies: Is OpenDNS logging data forever?

OpenDNS is an alternative DNS system. It is a for-profit company which makes most of its money through Google advertisements displayed to users when they enter invalid hostnames.

OpenDNS is the frequent darling of the security press. The very same journalists frequently pummel Google (and rightly so) for their lackluster approach to customer privacy.

Last month, OpenDNS's CEO started throwing dirt at Google for their pretty shameful keyword hijacking advertisement deal with Dell and others.

In a separate matter, Google recently adjusted its logging policy (although not nearly enough), after getting smacked around in a PR dust-up initiated by Privacy International. Given the fact that David Ulevitch and OpenDNS were willing to take such an admirable public stand against Google, I decided to look into OpenDNS's own privacy and logging policies - to see how they themselves fare against the Big G.

The most relevant portions of OpenDNS's privacy policy include:

OpenDNS's DNS service collects non-personally-identifying information such as the date and time of each DNS request and the domain name requested.

OpenDNS also collects potentially personally-identifying information like Internet Protocol (IP) addresses of website visitors and IP addresses from which DNS requests are made. For its DNS services, OpenDNS is storing IP addresses temporarily to monitor and improve our quality of service.

In addition, we may combine non-personally-identifiable information with personally-identifiable information in a manner that enables us to attribute website and DNS service usage to an individual customer's computer or network.

Other than to its employees, contractors and affiliated organizations, as described above, OpenDNS discloses potentially personally-identifying and personally-identifying information only when required to do so by law, court order, or when OpenDNS believes in good faith that disclosure is reasonably necessary to protect the property or rights of OpenDNS, third parties or the public at large.

What does this mean?

OpenDNS is logging information on all DNS requests received by their servers. They log the IP address that initiated each request. Thus, OpenDNS knows and stores the fact that at 11:10PM on Friday the 22nd of June, someone at the network address of some-user-in-washington-dc.comcast.com visited www.thepiratebay.org

OpenDNS logs data on every single unique domain name that you visit. They know that you have visited www.ilikeburritos.com and sometimes.ilikeburritos.com, but they don't have any info on which specific webpages in those domains that you visit. This is still a huge amount of information - more, possibly, than Google knows.

OpenDNS keeps this information for a "temporary," yet undefined period of time. Unlike Google, who promise to anonymize the data after a set period of time, it does not look like OpenDNS makes any attempt to anonymize any of their logs.

It does not look like OpenDNS has any kind of public log deletion policy, and thus they could still be storing log data years after the queries were sent to their servers.

This information could be requested by law enforcement, the RIAA, or an angry spouse in a divorce case. These would all be legal instances in which the courts could compel OpenDNS to reveal data on customers. The only way to avoid having 8 year old DNS requests showing up in a custody dispute would be for OpenDNS to announce and enforce a data logging and log deletion policy.

What can you do?

While OpenDNS is not perfect, they are probably still better than your average mega-corporate ISP. Some ISPs already seem to be selling data on which websites customers visit. Likewise, AT&T has quite thoroughly sold its customers out to the RIAA and MPAA.

Instead, the best thing to do is to write to Dave Ulevitch/OpenDNS (david [at]opendns [dot] com) and ask him to revise/create a data deletion and anonymization policy.

Sunday, June 17, 2007

An Emotional Blackmail Takedown: Remove The Podcast, Or We Shoot This Puppy

On Thursday of last week, I announced unofficial podcasts for the radio show This American Life. My podcast feed simply provided a deep link to the individual mp3 files on the TAL website, enabling listeners to podcast more than the one most recent episode allowed via the official podcast.

After the FBI raided my house last year, its fair to say that I've become a little bit more cautious. It's not to say that I'm not pursuing the same kinds of projects, it's just that I find out what my legal risks are before I go public. One interesting project that I've been working on has been stalled for the last few months, as my professor and I wait for a sign-off from both the Indiana University counsel's office, as well as a pro-bono outside counsel the nice folks at the EFF were able to put me in touch with. In that project, there are a number of uncertain legal risks, and it may upset some very powerful people - hence the caution.

Which is why, with this unofficial podcast, I made sure to check out my legal options before I put it online. Trespass to chattel - No problem. Copyright infringement - No problem. Deep linking - Probably no problem.

In the event that I got a proper takedown letter written by a lawyer, I felt that I was on really solid ground. What I did not plan for, was an emotional blackmail takedown that made me feel guilty. In hindsight, I suppose I should have predicted it - as it was the same reasonable, and non-heavy handed approach TAL took last year when two other guys setup podcasts.

The message they've given me is this: If you don't remove the podcast, we'll have to spend our limited resources (including the $20 that you donated last week) to pay lawyers to harass you.

I love This American Life. I look forward to a new episode every week, and I don't want to do anything that causes them to pull financial resources away from production.

TAL recently had a podcasting fund-drive, to pay for the $108k in yearly bandwidth costs for their approx 300,000 weekly downloads. In less than three weeks, they raised over $110k - solely by asking for money on their website, and in a request added to the weekly podcast.

Personally, I think that spending over $100k of listener donated money on bandwidth is an almost criminal waste of funds, when archive.org (who also provides free podcasting for Democracy Now) is more than willing to provide free bandwidth.

That massive waste of financial resources, is sadly, not under my control. What is under my control, on the other hand, is if TAL will have to spend several thousand dollars on legal bills - only to probably find out that everything that I've done is above board. This additional waste of resources is not something I would want to shoulder the responsibility for. In addition to wanting to do the right thing - both my girlfriend and my best friend are also fans of TAL, and I'm guessing that they'd give me a good kicking were I not to back down on this one.

Furthermore, I'll be traveling to and from the Privacy Enhancing Technologies workshop in Ottawa, Canada for the next 6 days. I'm really not comfortable with the idea that I'll be passing through US Customs + TSA with the possibility of a cease and desist sent by a US government funded group hanging over my head.

So - effective immediately, the podcast feeds come down. However, given its effectiveness, and lack of involvement of lawyers, I'm posting the letter I received from Daniel Ash at Chicago Public Radio. I hope that it will perhaps serve as an example to other, more litigation trigger happy organizations. Although, somehow I suspect that an appeal to conscience may not be as effective when the group has been voted the worst company in America.

Christopher,

First of all, thanks for your recent donation to support our bandwidth costs for This American Life. It helped make our online pledge drive a great success.

I am also writing because it has come to our attention that you have set up unofficial, “takedown resistant” podcasts of This American Life. We kindly request that you end this practice immediately.

On your blog, you go into impressive detail outlining the gray areas of the law in which you have ensconced your podcasts. Rather than first turning to our lawyers (at a high cost to our member-supported public radio station) to request that they look into the legality of what you’re doing, we’d like to ask nicely for your cooperation. And even if it turns out that you have found some podcasting equivalent to an off-shore tax shelter, we would still request that you stop. Here’s why:

As you mention, radio must adapt to the digital age. Our content is no longer tied to a single delivery system—that old-school box on your kitchen counter. Now, a number of much smaller, new-skool boxes enable you to take media content wherever you like and to consume it whenever you want.

Adapting to these changes is not always a smooth process. As you noted, we’ve been working hard to find the right digital delivery model for This American Life. For years, we did indeed have an exclusive contract with Audible that prevented us from offering a free podcast. Better late than never, we finally renegotiated that deal and launched a free podcast. Yes, it does have some limitations, chiefly that only one episode—our most recent broadcast—is available at any time. But once you’ve downloaded it, it’s yours to keep. And our archives, more than 300 episodes that span 13 years, are available for purchase at just $0.95 an episode.

In your blog, you suggest a different business model:

"The nicer, and smarter approach (in my humble opinion) would be to ditch the paid podcasting model, allow other organizations to host the TAL podcasts, and thus do away with that nasty bandwidth bill. In three weeks of fundraising, they were able to raise over $110k - more than enough to cover the costs of their 300,000 podcast downloads per week. If bandwidth were provided by others for free, this money could instead go towards TAL's other operating costs - and thus make up for the loss of the iTunes/audible.com revenue stream."

It’s not a bad idea—and one that we’re considering as the media industry as a whole works toward getting better metrics. We’re also watching advances in peer-to-peer technology; at some point, that might be a more plausible alternative for reliable delivery of our content. But these decisions aren’t yours to make.

We want to share This American Life with as many people as possible. But it’s also a very expensive show to produce. In order to offset our costs, we work to attract sponsorships and to develop business partners. Our current contracts are premised on a specific distribution model: namely, a weekly radio broadcast and free podcast, along with a minimally-priced back catalog. Not only do your podcasts make an end-run around this model, but they have the potential to disturb the already-muddy waters of measuring how many people download and listen to our files.

To recap, this is not a cease-and-desist kind of letter. No lawyers were consulted, and we hope there’s no need to involve them. This is simply a request. We acknowledge that our current business model isn’t perfect. But you have to admit that it’s a whole lot better (and to use your words, “nicer and smarter”) than it was 18 months ago. We want to be nice and smart in our practices, and we intend to continue in that direction as we move forward. But for now, this is where we stand, and it’s not going to change in the immediate future. We won’t ask you to stop what you’re doing for your own, personal enjoyment of the show; but hope you understand the reasoning behind our request that you take down the podcast feeds.

Thanks for your consideration.

Daniel

Daniel O. Ash | Vice President | Strategic Communications | Chicago Public Radio

Friday, June 15, 2007

Can The US government Infringe on Trademark

The Transportation Security Administration has setup a new page on their website to react to the recent story about a woman being detained at DCA Reagan Airport for spilling her toddler's "sippy" cup. I'm not going to discuss the specifics of that case, as it's not important.

What is interesting, is that the new section of TSA's website is called "MythBusters". As many of you may know, MythBusters is the name of a hugely popular TV show on the Discovery Channel.

It turns out that Discovery Communications has registered the term MythBusters with the US Trademark office as: Entertainment services in the nature of non-fiction television programming featuring examination of popularly held beliefs and misinformation; information regarding same provided via a global computer network.

Trademark isn't really an area of IP law that I know too much about. My (limited) expertise thus far is mainly in copyright - as it is the area that I'm most likely to get myself in trouble with.

I know that the US government has carved itself a number of exceptions in other areas of the law. For example, the US government cannot commit the tort of "patent infringement" - and instead, merely has to pay "reasonable and entire compensation" to a patent owner for the unauthorized use of a patent.

So - I pose the question to those out there who know more about the law than me:

Can TSA name a section of their website MythBusters (note, even the same capitalization as the TV show) without breaking any laws? TSA's new website provides video footage that examines popularly held beliefs and misinformation, which is then delivered via a global computer network (the Internet).

Is this legit, or did TSA fail to run this by their in-house counsel?

Thursday, June 14, 2007

A Takedown Resistant, Unofficial Podcast Feed for "This American Life"

Update Sept 28, 2008:

I get a fair number of hits to this page from people looking for mp3 copies of This American Life episodes.

The short version, is that you can get each episode here: http://audio.thisamericanlife.org/jomamashouse/ismymamashouse/EPISODENUMBER.mp3

The longer explanation can be found here: New location for This American Life Mp3s

The podcast feed that I created, as described in this blog post, was taken down due to a letter sent by This American Life. To read it, go here: An Emotional Blackmail Takedown: Remove The Podcast, Or We Shoot This Puppy

The vanilla feed, which lists the 15 most recent episodes as broadcast by TAL.

A feed of only new episodes, which does not include the reruns as broadcast by TAL.

Please note that this was not done with the permission or consent of NPR, PRI, or the fine folks at This American Life. Show your support, go to the TAL website, and donate 20 bucks.

Introduction

I'm a big fan of the Chicago Public Radio show This American Life (TAL), which is broadcast on many US National Public Radio affiliates nationwide. For some reason, the public radio and TV business (in the US and elsewhere) has not yet figured out this whole Internet thing, and thus many of their online/podcast offerings are less than stellar.

Many young people do not listen to radio anymore. Even when I'm in the US - simply put - if This American Life, Democracy Now, Open Source, Radio Lab and other fantastic radio shows were not available as podcasts - there is no way I'd listen to them. I cannot concentrate on work while listening to an informative radio show, and thus for the most part, streaming them on my computer is just not practical. Instead, they are perfect for the tram ride/walk to work, airplane/bus trips, or walks in the park.

TAL's current business/podcasting model is as follows:

The latest episode can be downloaded through a podcast feed for free from their website.

Archived episodes can be streamed online (via a flash player) on their website but not downloaded.

Listeners wishing to podcast archived episodes must buy them online - from either Apple's iTunes store, or audible.com for 99 cents

This is sub-optimal for many reasons - the most important of which is that I have no desire to enrich Steve Jobs. When I give money to TAL, it is through a direct donation on their website. I do not wish for Apple, or any other middleman to take 30-50%.

TAL has clearly been struggling to find a business plan that works - and are probably contractually bound by their recently renegotiated contract with Audible.com to not allow their back-catalog to be podcasted for free. As Current.org noted, "[t]he barrier to podcasting for years was the longstanding Audible deal. The vendor sold episodes of TAL for $3.95 a pop, barring the show from offering free downloads."

TAL has also recently had a podcasting fund-drive to pay for their yearly bandwidth bill of $108k. I've given them $20, but I can't figure out why they don't just put all of their content on archive.org (as Democracy Now has done), and thus do away with their bandwidth bill completely. Likewise, TAL is extremely popular amongst many tech geek circles - I am sure that a company or two would step up to the plate and give them free bandwidth if they asked.

I suspect that their unwillingness to let others host their media comes from their desire to somehow find a way to preserve their audible.com/iTunes deal. If they are making more than $108k through paid downloads, then perhaps this is a wise choice. The nicer, and smarter approach (in my humble opinion) would be to ditch the paid podcasting model, allow other organizations to host the TAL podcasts, and thus do away with that nasty bandwidth bill. In three weeks of fundraising, they were able to raise over $110k - more than enough to cover the costs of their 300,000 podcast downloads per week. If bandwidth were provided by others for free, this money could instead go towards TAL's other operating costs - and thus make up for the loss of the iTunes/audible.com revenue stream.

When This American Life first moved from Realaudio to MP3 storage of their show archive, the decision seems to be primarily due to issues with Realplayer - and not because they wished to enable listeners to save copies of shows to their machines. While many listeners petitioned TAL for a podcast feed, the show resisted.

Due to the fact that the episodes were being stored as MP3s, it was quite possible for users to download them to their own computers, and for others to create a podcast RSS feed.

In 2006, Jared Benedict and Jon Udell did just this, and made unofficial This American Life podcast feeds. Soon, links appeared in popular websites such as BoingBoing, and shortly after, the TAL webmaster sent a polite takedown request to both gentlemen. Jared sums it up by stating, "we received friendly emails from Ms. Meister, This American Life’s webmaster, making a request to take down the hyperlinks and RSS feeds, or she’d regrettably have to get lawyers involved. While Ms. Meister did miss the mark by accusing us of copyright infringement without a clear understanding of what we were actually doing, or what copyright law allows, she was trying to be polite and friendly which I appreciate."

Subsequently, both men took down their podcast feeds.

Last year, I discovered that TAL has all of their mp3s in one directory, amusingly located at http://audio.thisamericanlife.org/jomamashouse/ismymamashouse/episodenum.mp3. Clearly, their webmaster has a sense of humor.

Unfortunately, it is not possible to give itunes/other podcast clients a directory to download mp3 files from, so a podcast RSS feed is necessary. TAL's official podcast feed only has the most current episode available for download - which doesn't help me when I miss a week, or if I want to load up my iPod for a long journey.

I've learned enough about Internet law at this point to know that website owners are within their rights to get angry if you scrape their website on a regular basis. This comes down to a tort claim of trespass to chattels.

While TAL does provide a podcast RSS feed that I could scrape, I thought it best to avoid downloading data from there, if just to provide a bit of legal distance, and avoid any claim that I was hammering their servers.

Yesterday, I announced and made available a script that pulls data from Google's cache of popular RSS feeds. By using that script, I can be sure that I am not directly accessing TAL's webservers, nor am I increasing the load on their servers (assuming of course, that at least a couple other people using Google Reader to keep up to date with TAL episodes).

Regarding copyright - The content of the episodes, show logos, etc are protected by copyright law. However, episode names, broadcast dates and other info are not. It is for this same reason that one cannot copyright tv schedule listings.

The issue of deep linking (i.e. providing a direct link to a mp3 file on TAL's website) is a legal grey area. The case law here is by no means solid. A few years back, A California judged ruled that, "hyperlinking does not itself involve a violation of the Copyright Act" because "no copying is involved." I am fairly confident that I am on firm ground - since, afterall, the content I am linking to is perfectly legal (thus the DeCSS decision shouldn't apply) and accessible by any user who visits the TAL website and uses their embedded flash mp3 player.

With those legal issues dealt with for the most part, I decided to roll my own.

I've setup two unofficial podcast feeds for This American Life.

The vanilla feed, which lists the 15 most recent episodes as broadcast by TAL.

A feed of only new episodes, which does not include the reruns as broadcast by TAL.

Please note that this was not done with the permission or consent of NPR, PRI, or the fine folks at This American Life. Show your support, go to the TAL website, and donate 20 bucks.

Wednesday, June 13, 2007

Coding Around Trespass To Chattel with Google Feeds API

Begin Update July 27 2012:

The day after I published this post in 2007, I published a podcast feed for the NPR radio show This American Life, which used the Google Cache method to build an RSS feed for the popular radio show (TAL only provides the most recent couple shows available to the public for free via a podcast, rather than older shows).

Soon after, I took that podcast feed down after I was directly contacted by someone working for the radio show.

Fast forward five years, and Craigslist is now suing a company called 3Taps for using a method which appears to be surprisingly similar to the technique I described in 2007.

As in 2007, the legal issues surrounding this technique are unclear at best. Eric DeMenthon, the founder of PadMapper, a company using 3Taps data which was also sued by Craigslist told Ars Technica that "Since I'm not actually re-posting the content of the listings, just the facts about the listings, I figured (with legal advice) that there was no real copyright issue there."

Perhaps the courts will finally resolve this interesting legal question.

End update

Check out the RSS/XML cache fetcher here.

Many popular websites now provide RSS/XML feeds for all kinds of data relating to their website. Digg headlines, CNN news, Craigslist for sale items, and millions of blogs.

What happens when you want to build on that data in a way that the website owner didn't plan for, and probably won't approve of?

Since they're providing an RSS feed to the general public, it's unlikely that they can stop you from downloading it. Depending on how you are using the data, they may or may not be able to use copyright law against you. However, they may be able to claim that you're hammering their servers , and thus taking up their resources. In such a situation, normally, if they tell you to stop and you continue, you could be in trouble. This comes under a claim of trespass to chattel - an arcane area of the law that has been used by companies such as Ebay and American Airlines to stop people from scraping their website.

Google has a very popular online RSS reader service, unsurprisingly named Google Reader. Instead of requesting any given website's RSS feed every time one of their millions of users wishes to view this feed in their browser, Google instead serves the user a cached copy. Google's feed crawler ("Feedfetcher") software retrieves feeds from most sites less than once every hour. Some frequently updated sites may be refreshed more often.

The benefits to website owners is clear: Instead of getting hit once every hour by hundreds of thousands of people, they get visited maybe once every hour by Google's software, who then deal with the bandwidth issues involved in getting that data to customers.

Google is even nice enough to modify the user agent string that the feed crawler sends to webservers, to tell website owners exactly how many people have subscribed to the feed.

If you are logged into a Google service (such as Gmail), you can view Google's cached copy of any RSS feed by going to: http://www.google.com/reader/atom/feed/the_feed_you_want

For example, the This American Life podcast feed can be seen by going to: http://www.google.com/reader/atom/feed/http://feeds.thisamericanlife.org/talpodcast

Unfortunately, this cannot be automated with a script - as it requires that you login to a Google Account first. While this can be automated somewhat using the Google provided ClientLogin/AuthSub mechanisms, it would still mean that you'd be scraping the query results. ClientLogin (I believe) also requires that you solve a CAPTCHA, which isn't going to work for a ruby script running on a remote server. Google's terms of service forbid users from automating and scraping data returned from Google queries. If you want to get data from Google with a script, you need to use one of their APIs.

Luckily - Google recently announced a AJAX Feed API that permits developers to embed data from a cached RSS feed in their websites. The new Feed API allows you to put a bit of javascript on your website - which will then automatically display the contents of an RSS feed. That's great, but it's not quite what I'm after. As I want to do this from a ruby/perl script, and not from within a javascript webpage.

Niall Kennedy reverse engineered the API a bit, and figured out how to get a JSON encoded version of a cached RSS feed from Google's servers. When Niall first started looking into the guts of the Reader and its APIs late last year, members of the Google Reader development team left comments on his blog commending him on his reverse engineering, and provided key bits of information. Thus, while the release of this information isn't 100% sanctioned by Google, its fair to assume that Google is aware it is out there. Most importantly, this method uses the API (and requires an API key that you can request from Google) - which means that in using it, I'm safer and more legit.

Thus, I whipped up a ruby script that will parse the JSON output from Google, and give you a real RSS feed. A live demo, and source code of the RSS/XML cache fetcher can be downloaded here.

Why is this useful? It means that you can scrape someone's RSS/XML feed without ever going to their website. While I'm no lawyer (and this is certainly not legal advice) - I'd imagine that it'd be fairly difficult for anyone to attempt a trespass to chattel claim, since they'll be unable to prove any harm, or consumption of resources.

Google has millions of customers. It's quite likely that their bot is already crawling most popular RSS feeds. Thus, it's very very unlikely that by pulling up a copy of the feed, that you'll cause Google to go and fetch a new copy.

Many websites already provide different (unpassword protected) data to Google than they do to anyone else visting their site. It may be possible, by using Google's RSS cache, to take advantage of this architecture flaw/design decision, and access data that one wouldn't normally be able to get.

Finally, since you won't be hitting anyone's webservers, there is no link (at least in their weblogs) between you and them. They have no way of knowing how often you're accessing their feed. You're hidden amongst the millions of Google users.