The most common way public agencies protect our identities is anonymization. This involves stripping out obviously identifiable things such as names, phone numbers, email addresses, and so on. Data sets are also altered to be less precise, columns in spreadsheets are removed, and “noise” is introduced to the data. Privacy policies reassure us that this means there’s no risk we could be tracked down in the database. However, a new study in Nature Communications suggests this is far from the case. Researchers from Imperial College London and the University of Louvain have created a machine-learning model that estimates exactly how easy individuals are to reidentify from an anonymized data set. You can check your own score here, by entering your zip code, gender, and date of birth.
On average, in the U.S., using those three records, you could be correctly located in an “anonymized” database 81% of the time. Given 15 demographic attributes of someone living in Massachusetts, there’s a 99.98% chance you could find that person in any anonymized database. The tool was created by assembling a database of 210 different data sets from five sources, including the U.S. Census. The researchers fed this data into a machine-learning model, which learned which combinations are more nearly unique and which are less so, and then assigns the probability of correct identification.