Census bureau blurs data to keep names confidential: Ensuring anonymity is increasingly difficult

In a corner of the U.S. Census Bureau, a small group of statisticians has been sweating out the agency’s nightmare scenario: “re-identification.” That’s the term for a technique the bureau fears could allow marketers and other “intruders” to match anonymous census information with the names of the people who provided it. Such a concern is largely theoretical so far.

In a corner of the U.S. Census Bureau, a small group of statisticians has been sweating out the agency’s nightmare scenario: “re-identification.” That’s the term for a technique the bureau fears could allow marketers and other “intruders” to match anonymous census information with the names of the people who provided it. Such a concern is largely theoretical so far.

BUT IF PERFECTED, the technique could have great appeal to marketers of everything from french fries to financial services.

Knowing, for example, the names and addresses of wealthy people in a certain area who endure a lengthy commute every day would allow marketers of cellphones, car stereos, books on tape and other driving accouterments to home in on likely customers.

Confidentiality is key at the Census Bureau, since almost no one would participate in the great decennial inquisition without it. But ensuring anonymity is increasingly difficult in the age of the Internet and computer databases that contain millions of customer-purchase records and other information. The Census Bureau doesn’t publicize it, but two years ago one of its own statisticians began warning that increasingly powerful computers could make it possible for outsiders to glean personally identified information from census data.

“Existing record-linkage methods are so powerful” wrote Census Bureau statistician William E. Winkler in 1998, that matching personal data to names is possible “by relatively naive persons using commercially available software.”

In response to such concerns, the Census Bureau is making some significant changes for the 2000 census, including a stepped-up practice of masking the information it releases about individuals.

The bureau for decades has engaged in a little-known technique called “data swapping,” in which a few key pieces of information about one person are switched with those of another person with a similar background living nearby. For example, to mask a data file containing the ages and incomes of six people, researchers would randomly rearrange the income levels so that within one census block, a 21-year-old originally listed as making $20,000 is now listed as making $15,000, while a 50-year old making $15,000 is now listed as making $20,000. The process allows researchers to continue to draw valid observations from the file, since the swapping doesn’t change the totals for each data column within a census block.

Since 1990, government statisticians also have added distortion techniques known as “random noise” and “coarsening” to further confuse things. These involve slightly altering a number, such as income level, upward or downward and offsetting it by moving another number in the opposite direction. The trick is to blur the information without making it invalid for the kind of analysis for which the census is designed. Yet in some cases, “users have found this extremely irritating and unacceptable,” one Census Bureau researcher noted in a recent paper.

The bureau also is expected to make a major change in the amount of data it releases this year in its searchable “public microdata files,” which provide customized access to some of its most useful and intriguing data. In the 1990 Census, the bureau catalogued in fine detail the backgrounds and economic status of groups as small as 100,000 people for these files, including, for instance, the precise time it takes each person to get to work. But to protect privacy, these detailed profiles now will be applied to groups of about 400,000 people.

The bureau still will release information on groups of 100,000, but that will be much more general. For example, the number of languages spoken disclosed for this size group will be partially collapsed, to 74 from 305, while the categories of ethnic origin will go to 143 from 312.

The agency’s data-swapping practice hasn’t aroused much controversy among the many users of Census data — federal and local agencies, businesses and academic researchers — since it still allows for detailed analysis. But enlarging the size of the microdata groups is another matter. The users who are most concerned are academics, who like to examine population changes over multiple decades and thus seek highly consistent data.

Mr. Winkler, whose 1998 paper was commissioned by the bureau to test its security, and other statisticians believe that masked data can be at least partially deconstructed by matching it against demographic data now easily accessible on the Internet, such as estimated income levels and home values.

“There is an increased concern because of the amount of data that may now be publicly available on the Web that perhaps wasn’t there years ago,” says Laura Zayatz, who heads a four-person division within the Census Bureau called the Disclosure Limitation Group. “In response to that, we are ending up with less detail in our public microdata files than was there 10 years ago.” But even with the new precautions, Ms. Zayatz says she has no way to be certain this year’s census data isn’t vulnerable to re-identification.

“At least they’re being honest,” says Latanya Sweeney, a data privacy expert at Carnegie Mellon University, who has raised alarms about the vast amount of personal information on the Internet, and the potential for government data to be matched with it. “Those little fragments combine to make each of us unique.”

The concern about re-identification is becoming widespread. Germany has outlawed such techniques at the behest of the state statistics agency. The U.S. Department of Health and Human Services is developing regulations that prohibit the release of any health data from its files that could help marketers, insurers or others identify individuals.

So far, however, the concerns remain mostly untested. The only people known to have succeeded in census data re-identification are the bureau’s statisticians and privacy advocates. Gerald Gates, who handles policy for the Census Bureau, says he knows of no commercial researchers who are trying to decode such data. But, he says, “Given enough money and enough time, you can do a lot of things.”

The re-identification process is highly complex and doesn’t have a high yield: In a Census Bureau test, only 10 percent of survey participants could be re-identified. But that is enough for the bureau to be concerned.

“I think many people feel they could probably obtain information easier from some other source than trying to obtain it from a census file,” says the bureau’s Ms. Zayatz, “but we’re still very protective.”

Author: Glenn R. Simpson

News Service: msnbc

URL: http://www.msnbc.com/news/530646.asp

Leave a Reply

%d bloggers like this: