Lowering the cost of anonymization

a PhD thesis

This is the HTML version of my PhD thesis, titled Lowering the cost of anonymization. I defended it in December 2020, and it was accepted by my jury, composed of:

  • Prof. David Basin, examiner and supervisor,
  • Prof. Dennis Hofheinz, co-examiner,
  • Prof. Carmela Troncoso, co-examiner,
  • and Prof. Ashwin Machanavajjhala, co-examiner.

There are minor differences with the official PDF version: some text and formatting was changed while compiling the original .tex files to HTML, and some errors were fixed after being detected by careful readers. The PDF version is still the canonical one, however.

If you would like to cite this thesis, you can use the following BibTeX entry.

  title  = {Lowering the cost of anonymization}, 
  author = {Desfontaines, Damien}, 
  year   = {2020}, 
  school = {ETH Zurich} 

Note, however, that the vast majority of this dissertation has also been published as scientific papers and co-authored with other people; please refer to the personal contributions section for more details, and cite accordingly. You can also reuse and/or reshare this work, it is licensed under the CC BY-NC 4.0 license.

Finally, please note that all opinions expressed in this dissertation are mine, and do not represent the views of my current or past employers.

Now that disclaimers are out of the way, feel free to start with the abstract (in three languages), or with the introduction, or jump directly to any section using the table of contents below. Some of the sections are marked with a little flower : those are the parts where there’s no or very little math, so you can also choose to only read them if you prefer keeping things high-level.

  Personal contributions
1  Introduction
2  Defining anonymization
 2.1  From syntactic to semantic privacy
  2.1.1  -anonymity 
  2.1.2  -map 
  2.1.3  -diversity 
  2.1.4  -presence 
  2.1.5  Flaws of syntactic privacy definitions
  2.1.6  Differential privacy
 2.2  Systematizing variants & extensions of differential privacy
  2.2.1  Preliminaries
  2.2.2  Quantification of privacy loss (Q)
  2.2.3  Neighborhood definition (N)
  2.2.4  Variation of privacy loss (V)
  2.2.5  Background knowledge (B)
  2.2.6  Change in formalism (F)
  2.2.7  Relativization of the knowledge gain (R)
  2.2.8  Computational power (C)
  2.2.9  Summarizing table
  2.2.10  Related work
 2.3  Conclusion
3  Differential privacy under partial knowledge
 3.1  Definitional intricacies of partial knowledge
  3.1.1  Related work
  3.1.2  Correlated data
  3.1.3  Passive vs. active attackers
 3.2  Opportunities: aggregations & thresholding
  3.2.1  Counting queries
  3.2.2  Thresholding
  3.2.3  Application to -anonymity
  3.2.4  Composition
 3.3  Limits: cardinality estimation
  3.3.1  Preliminaries
  3.3.2  Private cardinality estimators are imprecise
  3.3.3  Weakening the privacy definition
  3.3.4  Mitigation strategies
 3.4  Conclusion
4  From theory to practice
 4.1  Analyzing privacy risks at scale
  4.1.1  Introduction
  4.1.2  Definitions and system design
  4.1.3  Estimating reidentifiability and joinability using KHLL
  4.1.4  Experimental validation
  4.1.5  Summary and perspectives
 4.2  Building a usable differentially private query engine
  4.2.1  Introduction
  4.2.2  A simple example: histograms
  4.2.3  System model and design
  4.2.4  Accuracy
  4.2.5  Usability
  4.2.6  Robustness and testing
 4.3  Further improvements and open questions
  4.3.1  Beyond counts and sums
  4.3.2  Partition selection
  4.3.3  Improving coverage and utility
  4.3.4  Operational aspects of anonymization
  4.3.5  Other possible research directions
5  Conclusion


All opinions here are my own, not my employers.
I'm always glad to get feedback! If you'd like to contact me, please do so via e-mail (se.niatnofsed@neimad) or Twitter (@TedOnPrivacy).