Russell's Blog

New. Improved. Stays crunchy in milk.

What Google knows

Posted by Russell on April 28, 2010 at 11:26 a.m.
After six months of using Google Latitude, I've amassed about 7108 location updates, or about 38 a day. It would probably be a lot more if I hadn't managed on occasion to break the GPS or automatic updating by fiddling with the software.

It's actually quite useful to have this data, especially if it's correlated with some richer information. For example, I've consulted the data to answer questions like, "Where was that awesome sandwich place I ate at last month?" It's also extremely useful to be able to share this data with Google because it allows me to quickly cross-reference location coordinates with Google's database of businesses and addresses. You can also download your complete location history in one giant blob (just ignore the warning that the History map only displays 500 datapoints, and download the KML file). Once you have the KML file, you can do whatever you want with it. For example, I uploaded mine to Indiemapper to map my wanderings for the last six months (Indiemapper is cool, but I quickly found that this dataset is really much too big for a Flash-based web application).

Not surprisingly, I spent most of my time in California, mostly in Davis and the Bay Area, with a few trips to Los Angeles via I-5, the Coast Starlight, and the San Joaquin (the density of points along those routes is indicative of the data service along the way).

The national map shows my trip to visit my dad's family in New Jersey and Massachusetts, as well as a layover in Denver that I'd completely forgotten about.

I have somewhat mixed feelings about this dataset. On one hand, it's very useful to have, and sharing it with my friends and with Google is very useful. It's also cool to have this sort of quantitative insight into my recent past so easily accessible. On the other hand, I'm not particularly happy with the idea that Google controls this data. I chose the word controls deliberately. I don't mind that they have the data -- after all, I did give it to them. As far as I know, Google has been a good citizen when it comes to keeping personal location data confidential. The Latitude documentation makes their policy pretty clear :

Privacy

Google Location History is an opt-in feature that you must explicitly enable for the Google Account you use with Google Latitude. Until you opt in to Location History, no Latitude location history beyond your most recently updated location if you aren't hiding is stored for your account. Your location history can only be viewed when you're signed in to your Google Account.

You may delete your location history by individual location, date range, or entire history. Keep in mind that disabling Location History will stop storing your locations from that point forward but will not remove existing history already stored for your Google Account.

...

If I delete my history, does Google keep a copy or can I recover it?

No. When you delete any part of your location history, it is deleted completely and permanently within 24 hours. Neither you nor Google can recover your deleted location history.

So, that's what they'll do with it, and I'm happy with that. What bothers me is this: Who owns this data?

This question leads directly to one of the most scorchingly controversial questions you could ask for, and there are profound legal, social, economic and moral outcomes riding on how we answer it. This isn't just about figuring out what coffee shops I like. If you want to see how high the stakes go, buy one of 23andMe's DNA tests. You're giving them access to perhaps the most personal dataset imaginable. In fairness, 23andMe has a very strong confidentiality policy.

But therein lays the problem -- it's a policy. Ambiguous or fungible confidentiality policies are at the heart of an increasing number of lawsuits and public snarls. For example, there is the case of the blood samples taken from the Havasupai Indians for use in diabetes research that turned up in research on schizophrenia. The tribe felt insulted and misled, and sued Arizona State University (the case was recently settled, the tribe prevailing on practically every item).

You can't mention informed consent and not revisit HeLa, the first immortal human cells known to science. HeLa was cultured from a tissue biopsy from Henrietta Lacks and shared among thousands of researchers -- even sold as a commercial product -- making her and her family one of the most studied humans in medical history. The biopsy, the culturing, the sharing and the research all happened without her knowledge or consent, or the knowledge or consent of her family.

And, of course, there is Facebook -- again. Their new "Instant Personalization" feature amounts to sharing information about personal relationships and cultural tastes with commercial partners on an op-out basis. Unsurprisingly, people are pissed off.

Some types of data are specifically protected by statute. If you hire a lawyer, the data you share with them is protected by attorney-client privilege, and cannot be disclosed even by court order. Conversations with a psychiatrist are legally confidential under all but a handful of specifically described circumstances. Information you disclose to the Census cannot be used for any purpose other than the Census. Nevertheless, there are many types of data that have essentially no statutory confidentiality requirements, and these types of data are becoming more abundant, more detailed, and more valuable.

While I appreciate Google's promises, I'm disturbed that the only thing protecting my data is the goodwill of a company. While a company might be full of a lots of good people, public companies are always punished for altruistic behavior sooner or later. There is always a constituency of assholes among shareholders who believe that the only profitable company is a mean company, an they'll sue to get their way. Managers must be very mindful of this fact as they navigate the ever changing markets, and so altruistic behavior in a public company can never be relied upon.

We cannot rely on thoughtful policies, ethical researchers or altruistic companies to keep our data under our control. The data we generate in the course of our daily lives is too valuable, and the incentives for abuse are overwhelming. I believe we should go back to the original question -- who owns this data? -- and answer it. The only justifiable answer is that the person described by the data owns the data, and may dictate the terms under which the data may be used.

People who want the data -- advertisers, researchers, statisticians, public servants -- fear that relinquishing their claim on this data will mean that they will lose it. I strongly disagree. I believe that people will share more freely if they know they can change their mind, and that the law will back them up.

Update

The EFF put together a very sad timeline of Facebook's privacy policies as they've evolved from 2005 to now. They conclude, depressingly :
Viewed together, the successive policies tell a clear story. Facebook originally earned its core base of users by offering them simple and powerful controls over their personal information. As Facebook grew larger and became more important, it could have chosen to maintain or improve those controls. Instead, it's slowly but surely helped itself — and its advertising and business partners — to more and more of its users' information, while limiting the users' options to control their own information.

Comcast melts in the rain

Posted by Russell on April 20, 2010 at 10:57 p.m.
For reasons I do not wish to fathom, my internet connection from home sucks whenever it rains. When I try to imagine why this might be the case, it calls to mind some truly horrifying images of what might be going on in Comcast's wiring closets.

How much does it suck? Well, here is a histogram of 200 ping times from my house to a machine at UC Davis, about 3000 feet from my front door. For comparison, I simultaneously collected 200 pings from my colo machine, which is 3000 miles away in Boston. The inbound and outbound packets from the colo go over Level3, so I've labeled it thusly.

Now, I wouldn't really expect a residential cable modem connection to measure up very well against a colocated server in terms of latency, but this isn't just a failure to measure up. This is just a regular old fashioned failure.

What ticks me off the most is that I pay $636 a year for this crap, and that my only alternative is AT&T DSL. I'd rather shave my tongue with a used bayonet than see a penny of my income fall into the hands of AT&T. Why does broadband suck in America?

I believe, Sir, that I may with safety take it for granted that the effect of monopoly generally is to make articles scarce, to make them dear, and to make them bad.
- Thomas Babington Macaulay