StATS: A clumsy attempt at anonymization (August 15, 2006)
Statisticians frequently deal with confidentiality issues when deciding what
type of data and what amount of detail should be withheld to protect sensitive
information about individual patients or institutions. It's not an easy task
and there are some subtle traps. And sometimes there are not so subtle traps.
At the request of some researchers, America Online (AOL) released data on 20
million web searches performed 650 thousand AOL users over a three month span.
They released the data, not just to those researchers, but to the general
public. AOL quickly realized that this was a bad idea and removed the database,
but it had already been copied to many locations. It is unlikely that they will
ever be able to persuade the web owners at all the other locations to take the
The data was anonymized by replacing the user name with a random number. This
is important, because some of the search terms are for rather sensitive items.
Examples of things that people searched on are
- "can you adopt after a suicide attempt" or
- "how to tell your family you're a victim of incest."
But replacing a name by a number did not come even close to anonymizing all
of the records. The problem is that people will do web searches about things
that reveal hints about themselves. Actual searches listed in the data base
included things like geographic locations:
- "gynecology oncologists in new york city,"
- "orange county california jails inmate information,"
- "employment needed- louisville ky," or
- "salem probate court decisions,"
or places where the searchers shopped or banked or got health care,
- "gerards restaurant in dc,"
- "st. margaret's hospital washington d.c.,"
- "l&n federal credit union," or
- "mustang sally gentlemans club,"
or products that the searchers owned,
- "cheap rims for a ford focus," or
- "how to change brake pads on scion xb,"
or their hobbies,
- "knitting stitches," or
- "texas hold'em poker on line seminars."
It gets even more revealing when people do web searches on their relatives or
These individual searches are, according to one report, like individual
pieces in a mosaic. Put enough of them together and you can get a really clear
picture of who the searcher is. Can you actually identify people from their web
searches? The answer is yes.
- A Face is Exposed for AOL Searcher No. 4417749. Michael Barbaro and
Tom Zeller, Jr. The New York Times (August 9, 2006). www.nytimes.com/2006/08/09/technology/09aol.html
[Available only for a few days more for free.]
According to the article, user number 4417749 searched for
- "landscapers in Lilburn, Ga," and
- "homes sold in shadow lake subdivision gwinnett county georgia,"
as well as the names of several people, all of whose last names were Arnold.
It didn't take long for the New York Times to track down a 62 year old widow
named Thelma Arnold.
Ms. Arnold, who agreed to discuss her searches with a reporter, said she
was shocked to hear that AOL had saved and published three months’ worth of
them. “My goodness, it’s my whole personal life,” she said. “I had no idea
somebody was looking over my shoulder.”
This is an important lesson that statisticians have been aware of for some
time. An individual piece of information by itself may not compromise someone's
privacy, but will do so when it is combined with other pieces of information.
Knowing that someone lives in a small town still preserves anonymity, but when
that small town name appears in a database of all pediatric heart transplant
cases, you have a problem.
I posted this article on the
Wiki as well,
Protection of Human Particpants in Survey Research: A Source Document for
Institutional Review Boards. American Association for Public
Opinion Research. Accessed on 2005-08-15. www.aapor.org/default.asp?page=news_and_issues/aapor_statement_for_irb
- Medical privacy and medical research--judging the new federal
regulations. G. J. Annas. New England Journal of Medicine 2002: 346(3);
- Threshold protocol for the exchange of confidential medical data.
J. J. Berman. BMC Med Res Methodol 2002: 2(1); 12.
- A proposed architecture and method of operation for improving the
protection of privacy and confidentiality in disease registers. T.
Churches. BMC Med Res Methodol 2003: 3(1); 1.
- Welcome to the
American Statistical Association's Privacy, Confidentiality, and Data Security
Website. Committee on Privacy and Confidentiality, American
Statistical Association. Accessed on 2003-08-11. www.amstat.org/comm/cmtepc/
Insurance Portability and Accountability Act Privacy Regulations: Consequences
for Use and Disclosures of Patient Information for Research Purposes.
Michele Garvin, Jessica Lind, Published in the July/August 2001 NCURA
Newsletter. Accessed on 2003-09-08. www.ncura.edu/newsroom/enews/August2001/HIPAA.doc
- The Effect of the New Federal Medical-Privacy Rule on Research. J.
Kulynych, D. Korn. NEJM 2002: 346(3); 201-204.
Data Encryption Tutorial — Lesson 1. Julie Meloni. Accessed on
Privacy - National Standards to Protect the Privacy of Personal Health
Information. Office for Civil Rights, U.S. Department of Health
and Human Services. Accessed on 2003-03-14. www.hhs.gov/ocr/hipaa/privacy.html
Issues to Consider in the Research Use of Stored Data or Tissues.
Office for Protection from Research Risks, Published by the U.S. Department of
Health and Human Services, November 7, 1997. Accessed on 2003-07-28.
NIH Data Sharing Policy and Implementation Guidance. Office of
Extramural Research, U.S. National Institutes of Health. Accessed on
Data Sharing Policy. Office of Extramural Research, U.S. National
Institutes of Health. Accessed on 2005-04-20. grants.nih.gov/grants/policy/data_sharing/index.htm
Investigator Checklist for HIPAA Privacy Rule Compliance.
Partners Human Research Committee. Accessed on 2003-03-14.
- PGP Corporation. Protecting
Confidential Information. In Transit, In Storage, Everywhere, All the Time..
PGP Corporation. Accessed on 2003-09-08. www.pgp.com/
- Consent to the publication of patient information. Peter A Singer.
BMJ 2004: 329(7465); 566-8.
- The high cost of skepticism. Carol Tavris. Skeptical Inquirer 2002:
25(4); 41-44. [Full
text] (Plan, Privacy)
Research Repositories, Databases, and the HIPAA Privacy Rule.
U.S. Department of Health & Human Services. Accessed on 2004-01-27.
Clinical Research and the HIPAA Privacy Rule. U.S. Department of
Health & Human Services. Accessed on 2004-02-20.
Protecting Personal Health Information in Research: Understanding the HIPAA
Privacy Rule. U.S. Department of Health and Human Services.
Accessed on 2003-04-22. privacyruleandresearch.nih.gov/pr_02.asp
Information for Covered Entities and Researchers on Authorizations for
Research Uses or Disclosures of Protected Health Information [pdf].
U.S. Department of Health and Human Services. Accessed on 2003-12-01.
Institutional Review Boards and the HIPAA Privacy Rule. U.S.
Department of Health and Human Services. Accessed on 2003-11-10.
This page was written by
Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more
information? I have a page with general help
resources. You can also browse for pages similar to this one at
Category: Privacy in research.