Browse Ask a Question
Tools Add
Rss Categories

Which type of threshold should we use for de-identification ?

Author: Khaled El Emam Views: 1167 Created: 12-10-2009 19:00 Last Updated: 07-03-2010 08:11


Many types of thresholds have been suggested and used for deciding when a data set is de-identified. Some common ones are:

  • Cell size of 5, 3, or 10
  • Uniqueness
  • Rareness

 

A question that comes up in practice is "which one should we use ?".

 

In fact, all three of these are related. The general rule is:

 

 X% of the records are in cell sizes >= k (or equivalence classes of size k)

A common instantiation, called 5-anonymity is:

 

 100% of the records are in cell sizes >= 5

 

This means that every possible value on the quasi-identifiers occurs at least five times.

 

The uniqueness criterion can be stated as 2-anonymity:

  

  100% of the records are in cell sizes >= 2

 

Although, there are cases where 95% and 80% are acceptable values for X.

 

For example, some cancer registries release their data to researchers if less than 20% of their records are unique, and to the public if less than 5% of their records are unique.

 

The third criterion, rareness, means one has to ensure that there are no rare records. The general rule here is:

 

 all equivalence classes have >X% of the records in the population

 

This rule ensures that there are no equivalence classes that are relatively rare. Rareness is often defined in terms of the population not in terms of the records in the data set.

 

For example, some national statistical agencies will not disclose census information if any equivalence classes cover less than or equal to 0.5% of the population. This is the rule used to justify not releasing individual ages above 89 years because very few people live beyond that age (i.e., fewer than 0.5% of the population are in each of the 90+ age range).

 

The question is, which one of the above rules should we use, and what should the values be ? The are no hard rules on this, but a reasonable approach is to use precedent.

 

The argument for using precedent is that it signifies acceptability. If a particular rule has a lot of precedent then it suggests that society has accepted the level of risk implied by the rule. For example, there is a lot of precedent spanning multiple decades for the cell size of five rule, so it is safe to assume that this is a generally accepted level of risk.

 

Precedent may be specific to a certain type of data or registry. For example, some precedents may be more acceptable for the disclosure of cancer registry data, but may not be acceptable for sexually transmitted disease or mental health data. Also, of course, it will depend on who the data is being disclosed to.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


Rss Comments
  • There are no comments for this article.
Info Add Comment
Nickname: Email (will not be shown): Subject: Question:
Info Ask a Question