We all know that data security is essential, but are we doing enough to protect our data feeds?
In 2006, Netflix launched a competition to improve the accuracy of their recommendation algorithm, releasing an anonymized dataset of 10 million consumers move ratings. The following year, two researchers at the University of Texas, Arvind Narayanan and Vitaly Shmatikov published a paper about the Netflix Prize dataset. They had taken a look at this data and developed a methodology to break the anonymization of this dataset.
They could look at the 10 million anonymized records and track them back to individuals with astonishing accuracy. Their work lead to a class action lawsuit against Netflix for breach of privacy, including a high profile case of an "in the closet" lesbian who claimed to have been "outed" by the Netflix dataset.
Ten years later, the UK government has been conducting trials of anonymous movement of passengers through the London subway by measuring WiFi. Throughout Europe, government agencies are required by law to release any information so requested by citizen's but the government has refused to release this information on the basis that it's anonymization is not secure, and individuals can still be reidentified through the data.
If both Netflix and the UK government are having difficulty in anonymizing their datasets, there must be something going on.
So what is the problem with so-called anonymous data?
The principal of data anonymization is to take anything personally identifiable, for example, your phone number or your email address, and replace it with an anonymous token.
You might initially think this makes the data secure, but by combining the dataset with other datasets, it's relatively straightforward to reidentify. If I have licensed a dataset that lists the websites that anonymous cookies have visited. I might have bought this dataset to understand better which sites make a better target for my online campaign or to retarget visitors from another service when they visit mine.
Now when I receive this data, it doesn't contain any personally identifiable information (PII), but if I have a login on my website, then I can match a visitor to my site directly to the feed. All I need to do is match the data and time of the actions of an anonymous user on the data feed with the date and time of activities on my website, and I've got a match. I've now reidentified the consumer and have access to all of the other sites they've visited from the cookie feed.
Why is this a big issue?
The biggest problem with reidentification is that any PII is regulated differently to other data. The serial number on the engine of my car can be stored in multiple different databases and freely exchanged between third parties by my auto provider. My name and address are tightly regulated, my personal property, and cannot be shared between third parties without my explicit consent.
Furthermore, when people permit sharing of their personal information, it comes with restrictions of purpose. For example, you might provide their financial information to a bank as part of a mortgage application, but you are not permitting the bank to sell the same information to third parties to target you with better credit card deals.
If we are sharing data between parties and we cannot prohibit reidentification then we cannot guarantee the security of our consumer's data or that the uses of the data are going to comply with the terms and conditions we agreed with our consumers when we collected their data.
So what can we do about it?
The safest way to prevent reidentification is to follow the example of the UK government and not share any information with third parties. This does however restrict not only the consumer experience but also many legitimate business opportunities.
More commonly, businesses will share data but take contractual and technical precautions to prevent reidentification. The technical approaches to avoid reidentification are regular anonymisation or developing a service on top of the raw data.
The first of these is to change the anonymous tokens for each consumer on a routine basis. This limits the ability of third parties to track individuals over time, which significantly diminishes the ability to reidentify consumers. You might be able to follow that someone has visited a specific site once, but you cannot then track what they are doing the next day.
However, as well as restricting the risks of reidentification, this also reduces the usefulness of the data from a business model perspective. If you can only track what a consumer is doing from day to day, it's much less valuable than seeing the full picture of their activities over a year.
The only way of further reducing the chances of reidentification is to provide a service rather than a data feed. I've written previously about the perils of asking for the raw data, and when it comes to data security, offering an API to query information on aggregated subsets of consumers or even individual consumers in real time, is much more secure than providing a more detailed data feed to your subscribers.
The bottom line
At Deductive, our view is that all information relating to subscribers is potentially reidentifiable. All data is therefore possible PII and needs to the same level of security as PII.
The Netflix class action was finally settled in 2011 for $9m. That might be a drop in the ocean for Netflix financially, but reputationally no one can afford to play fast and loose with their consumers' data.
Contractually, you need to prevent third parties from reidentification and furthermore only share data with those you can trust to act accordingly. If your business relies upon sharing more broadly, we recommend developing an API or a data-driven service rather than providing a raw data feed.