One question that sometimes comes up is whether a data set can be considered identifiable if a person can find their own record(s) in there. This definition can be analyzed from a number of different perspectives.
A person may not know if they are in a data set if the data set is a sample. One example is if a data set is based on chart reviews from a random subset of patients at a clinic, then any randomly selected patient in that clinic will not necessarily know that they are in the data set created from the chart review. This uncertainty means that the above definition of identifiability may not be appropriate. One primary reason is that if a patient finds a record that matches their own characteristics they will not know if that record really belongs to them or to someone else.
As a caveat, if the clinic is small then the chances of another patient having exactly the same characteristics would also be small. Also, if the number of variables extracted from the charts is large, then it is less likely that there would be another patient at the clinic who is similar on all of the variables.
Another scenario is if a patient does know that s/he is in a data set. There are a couple of ways that a patient can know that their record is in that data set:
- If the patient knows that they are unique in the population, and they find a match in the chart review sample, then they would be confident that their own record has been discovered. But this assumes that the patient knows that they are unique in the population. There are some circumstances where that knowledge is reasonable. For example, Canadians living in urban areas are unique on their date of birth and residence postal code. Therefore, a patient can be confident that if they use these two variables and s/he found a match in the chart review sample, then it is almost certain that it is him/her.
- If the data set is not a sample but a whole population, say as in a population registry, then the patient would know for sure that they are in the data set, for example, if the data set is a provincial cancer registry. If the patient finds a single record that matches then s/he will know for sure that it was his/her record. If the patient is unique, then the patient will know for sure that s/he has discovered his/her own record.
Let us assume that one of the above two conditions is true. In that case, is the fact that the patient can find their own record a workable definition of an identifiable data set ?
If we accept the above definition then we are setting a high standard. A common way we model an intruder is to consider the kind of background knowledge the intruder would have about the target person (or persons) being re-identified. The more background information the intruder has the greater the re-identification risk. A person will have the maximum possible background information about themselves (i.e., if the intruder is also the target person being re-identified); much more than any other intruder would know. It is true that many people tell their friends and family many things, but they do not tell them absolutely everything. Therefore, the background knowledge of a person about themselves represents the maximum possible background information and therefore the maximum possible risk. If one wants to be conservative, then this is a good approach. But in many cases assuming that an intruder will know absolutely everything does not seem very plausible and sets quite a high standard. In fact, the standard would be so high that we would not be able to share any information at all unless:
- the data set disclosed is a random sample so that an individual would not know if their record is within the data set (i.e., no population registry could be considered de-identified almost by definition),
- the sample data set does not include many variables so that there would be other individuals with the same characteristics in the population (e.g., the clinic example mentioned above), and
- the underlying population is large enough that the chances of an individual being unique are quite small.
A counterargument that can be made is that people are now voluntarily (and involuntarily through their friends and colleagues) revealing more and more about themselves on their blogs, Facebook pages, and Tweets. This is certainly the case and more and more is being revealed everyday. Whether this type of self-exposure of personal information amounts to individuals revealing everything about themselves such that an intruder has the same background knowledge as the person themselves remain an empirical question. Although it is easy to argue that we have not quite reached that point yet.
Another scenario to consider is when the following two conditions are met:
- the data set has some quasi-identifiers and some sensitive information (an example of the quasi-identifiers would be the demographics),
- there are only two individuals in the data set that have exactly the same values on the quasi-identifiers,
- one of those individuals, say Bob, gets the data set, and
- Bob knows the second person who has the same characteristics, Joe.
Under these conditions, Bob would discover the sensitive information about Joe with certainty. Therefore, re-identifying one's own record resulted in the disclosure of sensitive information about another individual.
The approach we have taken is to define plausible intruders (or archetypes of intruders) and assess what type of background knowledge they would have. The three we consider are: a neighbor, an ex-spouse, and a reporter. The reason we selected these three intruders is because all of the re-identifications that have actually happened and that have been publicly acknowledged have been done by researchers, reporters, or in court cases. All three acknowledged types of intruders can use publicly available information (e.g., in public registries). All three acknowledged types of intruders can talk to neighbors or ex-spouses to get additional information. Therefore, by focusing on these three types of intruders we are addressing plausible risks that we know have happened.
Furthermore, we always ensure that there are always more than two records with the same values on the quasi-identifiers. That way the re-identification of one's own record does not facilitate the discovery of new information about someone else.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.