If you are trying to manage prosecutor risk, then you assume that the intruder has a specific target person in mind and is trying to re-identify that person's records in the disclosed data set. The intruder is also able to get some background information about that target person. This type of background information represents the quasi-identifiers you are interested in. The type of background information we assume depends on the intruder. The types of intruder we usually consider when managing prosecutor risk are: (a) a neighbor, (b) an ex-spouse, (c) an employer or colleague at work, (d) a relative, and (e) a stalker.
In a worse case scenario, a neighbor would know:
- Address and telephone information about the target individual
- Household and dwelling information (number of children, value of property, type of property)
- Key dates (births, deaths, weddings, admissions, discharges)
- Visible characteristics: gender, race, ethnicity, language spoken at home, weight, height, physical disabilities
- Profession
Of course, not all neighbors are friendly or nosy, and therefore, a particular neighbor may not know all of the above things. But these are plausible things that a neighbor would know by observing and casually interacting with the target individual and their family.
What would an ex-spouse know includes:
- The same things that a neighbor would know
- Basic medical history (allergies, chronic diseases)
- Income, years of schooling
An employer or relative would generally know less than the above two.
A stalker could be after a famous person or an estranged spouse or boy/girl friend. The quasi-identifiers that a stalker would be the same as an ex-spouse for the latter case, and whatever information is publicly available about a famous person in the former case. We generally make an assumption that an ex-spouse would have the most background information.
If any of the above variables exist in the disclosed data set, then you should take them into account in the re-identification risk analysis.
But we also have to be pragmatic. For example, there are no easy ways to de-identify diagnosis codes. Therefore, if they exist in the disclosed data set and represent a high re-identification risk, then that risk may be better mitigated using a data sharing agreement and audits (see our risk assessment methodology) rather than through de-identification.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.