Under this scenario, a data custodian is providing data to two researchers, A and B. The two data sets have some overlapping variables, for example, they may both have the patients' date of birth and postal codes. Also, the two data sets have no directly identifying variables in them. We will call these variables the quasi-identifiers. The custodian is concerned about the two researchers colluding and trying to link the two data sets together. The custodian does not want them to link the two data sets together because of privacy or other legal concerns. How can the custodian assess that risk ?
In a recent paper (to appear in PAIS 2010 and also attached to this article) we have developed some metrics that can be used to evaluate the proportion of records that can be re-identified if the two researchers try to link the data sets on the quasi-identifiers.
If the two data sets have N records and they have the same patients in them, then the proportion of records that can be correctly matched if the two researcher try to link their data sets is J/N. Here, J is the number of different values (called equivalence classes) on the quasi-identifiers. For example, {1/1/1980, K1H 8L1} would be one of the equivalence classes.
If one of the data sets is a sample of the other, with the smaller data set having n records, with n<N, then the equation is a bit more complicated and described more fully in the paper (equation 1). This will give you the proportion of the records in the smaller data set that will be correctly matched if the two researchers try to link their data sets.
Note that the two researchers will not know which records were successfully matched. For example, if the proportion produced by either of these two metrics is say 0.4, then it is not known which 40% of the records are correctly matched. Therefore, for the matching exercise to be worthwhile for the researchers, they need to have confidence that the match rate is high. It can be argued that any match rate below 80% will be sufficiently high risk for a researcher that it would not be worth it (i.e., less than 80% of the records were correctly matched).
A data custodian can then run cross-checks on their data releases to see if the risk arising from such potential collusion is acceptable.
The above analysis also applies if two different data sets have been disclosed to the same researcher at two different instances of time.
The
author(s) retain all copyright to this knowledgebase article. Please
include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.