Privacy Preserving Data Mining Bibliography

 

What's New:
- I will update the bibliography after Xmas. (11/7/2007) new!
- Welcome to the privacy preserving data mining (PPDM) bibliography. The goal of this page is to maintain and distribute a bibliography of PPDM-related publications. I hope that PPDM researchers and practitioners find this service useful. I welcome every help from the community in maintaining the bibliography. (05/01/2006)
- Any questions or comments should be directed to Kun Liu <kunliu1 AT cs.umbc.edu>.
- The ads in this page provide me and my colleagues with coffee and cookie, essential components of a healthy research :)


Contents

 

Privacy Preserving Data Mining Philosophical Issues

• Mary J. Cronin, "e-Privacy?," HOOVER DIGEST, No. 3, 2000. Adapted from the essay "Privacy and Electronic Commerce," by Mary J. Cronin, in the new Hoover Press book Public Policy and the Internet: Privacy, Taxes, and Contract, edited by Nicholas Imparato. [link]

[abstract]

Imagine going to a shopping mall in which researchers follow you from store to store, taking notes on every product you examine or buy. Would you shop in such a place? Chances are, you already do. Welcome to the Internet.

[bib]

''this is the bibtex entry ...''

• Alan J. Broder, "Data Mining, the Internet, and Privacy," International WEBKDD’99 Workshop San Diego, CA, USA, August 15, 1999. [link]

[abstract]

This paper addresses the inherent technological conflict between the desire for privacy by World Wide Web users, and the need of Web content providers and advertisers to more fully collect and utilize data about users. As the other papers in this volume illustrate, the Data Mining community has now turned its collective attention towards the Web as a fertile venue for research and development. In doing so, Data Miners have found themselves at the nexus of this conflict.

We present the technical issues regarding privacy from two perspectives. First, from the perspective of the Web user who may be unaware of the degree to which identifying information can be inadvertently disclosed. And second, from the perspective of a Data Miner we consider the extent to which privacy enhancing technologies could substantially invalidate data mining results.

[bib]

@INCOLLECTION{Broder_99,
  AUTHOR =       "Alan J. Broder",
  TITLE =        "Data Mining, the Internet, and Privacy",
  BOOKTITLE =    "Web Usage Analysis and User Profiling: International WEBKDD'99 Workshop San Diego, CA, USA, August 15, 1999 ",
  PUBLISHER =    "Springer Berlin/Heidelberg",
  YEAR =         "1999",
  editor =       "",
  volume =       "1836/2000",
  number =       "",
  series =       "",
  type =         "",
  chapter =      "p.56",
  pages =        "",
  address =      "",
  edition =      "",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• EPIC: Electronic Privacy Information Center, a public interest research center in Washington, D.C. [link]

[abstract]

[bib]


Privacy Preserving Data Mining Survey/General Issues

• V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, “State-of-the-Art in Privacy Preserving Data Mining,” ACM SIGMOD Record, vol. 3, no. 1, pp. 50-57, Mar. 2004. [link]

[abstract]

We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. A brief evaluation is performed, and some initial conclusions are made.

[bib]

@ARTICLE{Verykios_04v,
  AUTHOR =       "V. S. Verykios and E. Bertino and I. N. Fovino and L. P. Provenza and Y. Saygin and Y. Theodoridis",
  TITLE =        "State-of-the-art in Privacy Preserving Data Mining",
  JOURNAL =      "ACM SIGMOD Record",
  YEAR =         "2004",
  volume =       "3",
  number =       "1",
  pages =        "50--57",
  month =        "March",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, “Tools for Privacy Preserving Distributed Data Mining,” ACM SIGKDD Explorations, vol. 4, no. 2, 2003. [link]

[abstract]

Privacy preserving mining of distributed data has numerous applications. Each application poses di erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for speci c privacy-preserving data mining applications. This paper presents some components of such a toolkit, and shows how they can be used to solve several privacy-preserving data mining problems.

[bib]

@article{Clifton:03,
    author      = {C. Clifton and M. Kantarcioglu and J. Vaidya and X. Lin and M. Zhu},
    title       = "Tools for Privacy Preserving Distributed Data Mining",
    journal     = {ACM SIGKDD Explorations},
    volume      = {4},
    number      = {2},
    year        = {2003},
}

• B. Pinkas, “Cryptographic Techniques for Privacy Preserving Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12-19, 2002. [link]

[abstract]

Research in secure distributed computation, which was done as part of a larger body of research in the theory of cryptography, has achieved remarkable results. It was shown that non-trusting parties can jointly compute functions of their different inputs while ensuring that no party learns anything but the defined output of the function. These results were shown using generic constructions that can be applied to any function that has an efficient representation as a circuit. We describe these results, discuss their efficiency, and demonstrate their relevance to privacy preserving computation of data mining algorithms. We also show examples of secure computation of data mining algorithms that use these generic constructions.

[bib]

@ARTICLE{Pinkas_02,
  AUTHOR =       "B. Pinkas",
  TITLE =        "Cryptographic Techniques for Privacy Preserving Data Mining",
  JOURNAL =      "SIGKDD Explorations",
  YEAR =         "2002",
  volume =       "4",
  number =       "2",
  pages =        "12--19 ",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
}

• Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton, "When do data mining results violate privacy?," in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. [link]

[abstract]

Privacy-preserving data mining has concentrated on obtaining valid results when the input data is private. An extreme example is Secure Multiparty Computation-based methods, where only the results are revealed. However, this still leaves a potential privacy breach: Do the results themselves violate privacy? This paper explores this issue, developing a framework under which this question can be addressed. Metrics are proposed, along with analysis that those metrics are consistent in the face of apparent problems.

[bib]

@inproceedings{1014126,
 author = {Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton},
 title = {When do data mining results violate privacy?},
 booktitle = {KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining},
 year = {2004},
 isbn = {1-58113-888-1},
 pages = {599--604},
 location = {Seattle, WA, USA},
 doi = {http://doi.acm.org/10.1145/1014052.1014126},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Shipra Agrawal, Vijay Krishnan, Jayant Haritsa, "On Addressing Efficiency Concerns in Privacy Preserving Data Mining," In Proceedings of the 9th International Conference on Database Systems for Advanced Applications (DASFAA-2004). [link]

[abstract]

Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To encourage users to provide correct inputs, we recently proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results. However, mining the distorted database can be orders of magnitude more time-consuming as compared to mining the original database. In this paper, we address this issue and demonstrate that by (a) generalizing the distortion process to perform symbol-specific distortion, (b) appropriately choosing the distortion parameters, and (c) applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.

[bib]

''this is the bibtex entry ...''

• Chris Clifton and Murat Kantarcioglu and Jaideep Vaidya, "Defining Privacy for Data Mining," in Proceedings of the National Science Foundation Workshop on Next Generation Data Mining, November 1-3, 2002, Baltimore, MD. [link]

[abstract]

Privacy preserving data mining – getting valid data mining results without learning the underlying data values – has been receiving attention in the research community and beyond. It is unclear what privacy preserving means. This paper provides a framework and metrics for discussing the meaning of privacy preserving data mining, as a foundation for further research in this field.

[bib]

''this is the bibtex entry ...''




Additive Data Perturbation

• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data, pp. 439-450, May 2000. [link]

[abstract]

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers built with the original data.

[bib]

@INPROCEEDINGS{Agrawal:00,
  AUTHOR =       "R. Agrawal and R. Srikant",
  TITLE =        "Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD Conference on Management of Data",
  YEAR =         "2000",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "439--450",
  address =      "Dallas, TX",
  month =        "May",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• D. Agrawal and C. C. Aggarwal, "On the Design and Quantification of Privacy Preserving Data Mining Algorithms," Proc. ACM SIGMOD, pp. 247-255, 2001. [link]

[abstract]

The increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. A recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. This method is able to re- tain privacy while accessing the information implicit in the original attributes. The distribution reconstruction process naturally leads to some loss of information which is accept- able in many practical situations. This paper discusses an Expectation Maximization (EM) algorithm for distribution reconstruction which is more effective than the currently available method in terms of the level of information loss. Specifically, we prove that the EM algorithm converges to the maximum likelihood estimate of the original distribu- tion based on the perturbed data. We show that when a large amount of data is available, the EM algorithm pro- vides robust estimates of the original distribution. We pro- pose metrics for quantification and measurement of privacy- preserving data mining algorithms. Thus, this paper pro- vides the foundations for measurement of the effectiveness of privacy preserving data mining algorithms. Our privacy metrics illustrate some interesting results on the relative ef- fectiveness of different perturbing distributions.

[bib]

@INPROCEEDINGS{Agrawal:01,
  AUTHOR =       "D. Agrawal and C. C. Aggarwal" 
  TITLE =        "On the Design and Quantification of Privacy
                   Preserving Data Mining Algorithms",
  BOOKTITLE =    "Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems",
  YEAR =         "2001",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "247--255",
  address =      "Santa Barbara, CA",
  month =        "",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the Privacy Preserving Properties of Random Data Perturbation Techniques,” Proc. IEEE Int’l Conf. Data Mining, Nov. 2003. [link]

[abstract]

Privacy is becoming an increasingly important issue in many data mining applications. This has triggered the development of many privacy-preserving data mining techniques. A large fraction of them use randomized data distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random value distortion technique in privacy preservation. The paper notes that random objects (particularly random matrices) have "predictable" structures in the spectral domain and it develops a random matrix-based spectral filtering technique to retrieve original data from the dataset distorted by adding random values. The paper presents the theoretical foundation of this filtering method and extensive experimental results to demonstrate that in many cases random data distortion preserve very little data privacy. The paper also points out possible avenues for the development of new privacy-preserving data mining techniques like exploiting multiplicative and colored noise for preserving privacy in data mining applications.

[bib]

@INPROCEEDINGS{Kargupta:03a,
    AUTHOR      = {H. Kargupta and S. Datta and Q. Wang and K. Sivakumar},
    TITLE       = "On the Privacy Preserving Properties of Random Data Perturbation Techniques",
    BOOKTITLE   = {{Proceedings of the IEEE International Conference on Data Mining}},
    YEAR        = {2003},
    EDITOR      = {},
    PAGES       = {99},
    VOLUME      = {},
    NUMBER      = {},
    SERIES      = {},
    ADDRESS     = {Melbourne, FL},
    MONTH       = {November},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {},
    URL         = {},
    ABSTRACT    = {},
}

• K. Muralidhar and R. Sarathy, "Security of Random Data Perturbation Methods," ACM Transactions on Database Systems, Vol. 24, No. 4, December 1999, Pages 487–493. [link]

[abstract]

Statistical databases often use random data perturbation (RDP) methods to protect against disclosure of confidential numerical attributes. One of the key requirements of RDP methods is that they provide the appropriate level of security against snoopers who attempt to obtain information on confidential attributes through statistical inference. In this study, we evaluate the security provided by three methods of perturbation. The results of this study allow the database administrator to select the most effective RDP method that assures adequate protection against disclosure of confidential information.

[bib]

''this is the bibtex entry ...''

• K. Muralidhar and R. Sarathy, "A theoretical basis for perturbation methods," Statistics and Computing 13: 329–335, 2003. [link]

[abstract]

In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.

[bib]

''this is the bibtex entry ...''

• Alexandre Evfimievski, "Randomization in Privacy Preserving Data Mining," ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 43-48, December 2002. [link]

[abstract]

Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

[bib]

''this is the bibtex entry ...''

• A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proc. ACM SIGMOD/PODS Conf., June 2003. [link]

[abstract]

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.

[bib]

@INPROCEEDINGS{Evfimievski:03,
  AUTHOR =       "A. Evfimevski and J. Gehrke and R. Srikant",
  TITLE =        "Limiting Privacy Breaches in Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD/PODS Conference",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "211--222",
  address =      "San Diego, CA",
  month =        "June",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• C. W. Wu, "Privacy Preserving Data Mining: a signal processing perspective and a simple data perturbation protocol," 2nd Workshop on Privacy Preserving Data Mining, IEEE International Conference on Data Mining, Melbourne, FL, 2003. [link]

[abstract]

Privacy concerns over the proliferation of gathering of personal information by various institutions over the internet led to the development of data mining algorithms that preserve the privacy of those whose personal data are collected and analyzed. A novel approach to such privacy preserving data mining algorithms was proposed where the individual datum in a data set is perturbed by adding a random value from a known distribution. In these applications, the distribution of the original data set is important and estimating it is one of the goals of the data mining algorithm. This distribution is estimated via an iterative algorithmsuch as the ExpectationMaximization (EM) algorithmwhich was shown to have desirable properties such as low privacy loss and high fidelity estimates of the distribution. Each iteration of EM requires computation that is proportional to the size of the data set and can require large computation time to estimate the distribution. In this paper we propose two ways to reduce the amount of computation. First, we show that the problem can be recast as a deconvolution problem and signal processing algorithms can be applied to solve this problem. In particular we consider both a direct method and iterative methods which are more robust against noise and ill-conditioning. We show that the Richardson-Lucy deblurring algorithm is equivalent to EM after quantization. The signal processing approach also shows how the choice of perturbation affects information loss and privacy loss and allows us to clarify some points made in the literature.

In the second part of this paper, we propose a scheme for perturbing data which also has the nice properties of arbitrarily small privacy loss and arbitrarily high fidelity in the estimate. The main advantage of the proposed scheme is the simplicity of the estimation algorithm. In contrast to iterative algorithms such as EM, the proposed scheme estimates the unknown distribution in one step. This is significant in applications where the data set is very large or when the data mining algorithm is run in an online environment.

[bib]

''this is the bibtex entry ...''

• Z. Huang and W. Du and B. Chen, "Deriving Private Information from Randomized Data," in Proceedings of the 2005 ACM SIGMOD Conference, pp. 37-48, Baltimroe, MD, 2005. [link]

[abstract]

Randomization has emerged as a useful technique for data disguising in privacy-preserving data mining. Its privacy properties have been studied in a number of papers. Kar- gupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. However, it is still unclear what factors cause such a security breach, how they affect the privacy preserving property of the randomization, and what kinds of data have higher risk of disclosing their private contents even though they are randomized.

We believe that the key factor is the correlations among

attributes. We propose two data reconstruction methods that are based on data correlations. One method uses the Principal Component Analysis (PCA) technique, and the other method uses the Bayes Estimate (BE) technique. We have conducted theoretical and experimental analysis on the relationship between data correlations and the amount of private information that can be disclosed based our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be re- constructed more accurately, i.e., more private information can be disclosed.

To improve privacy, we propose a modified randomization

scheme, in which we let the correlation of random noises "similar" to the original data. Our results have shown that the reconstruction accuracy of both PCA-based and BE- based schemes become worse as the similarity increases.

[bib]

@INPROCEEDINGS{Huang_05,
  AUTHOR =       "Z. Huang and W. Du and B. Chen",
  TITLE =        "Deriving Private Information from Randomized Data",
  BOOKTITLE =    "Proceedings of the 2005 ACM SIGMOD Conference",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "37--48",
  address =      "Baltimroe, MD",
  month =        "June",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S. Guo and X. Wu, "On the Use of Spectral Filtering for Privacy Preserving Data Mining," Proceedings of the 21st ACM Symposium on Applied Computing (SAC06), Dijon, France, April 23-27, pp. 622-626, 2006. [link]

[abstract]

Randomization has been a primary tool to hide sensitive pri- vate information during privacy preserving data mining.The previous work based on spectral ¯ltering, show the noise may be separated from the perturbed data under some conditions and as a result privacy can be seriously compromised. In this paper, we explicitly assess the e®ects of perturbation on the accuracy of the estimated value and give the explicit rela- tion on how the estimation error varies with perturbation. In particular, we derive one upper bound for the Frobenius norm of reconstruction error. This upper bound may be ex- ploited by attackers to determine how close their estimates are from the original data using spectral ¯ltering technique, which imposes a serious threat of privacy breaches.

[bib]

@INPROCEEDINGS{Guo_06,
  AUTHOR =       "S. Guo and X. Wu",
  TITLE =        "On the Use of Spectral Filtering for Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the 21st ACM Symposium on Applied Computing (SAC'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "622--626",
  address =      "Dijon, France",
  month =        "April",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S.Guo, X.Wu and Y. Li, "On the Lower Bound of Reconstruction Error for Spectral Filting," Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD06), Berlin, Germany, Sept 18-22, 2006. [link]

[abstract]

This is the abstract ...

[bib]

@INPROCEEDINGS{,
  AUTHOR =       "S.Guo and X.Wu and Y. Li",
  TITLE =        "On the Lower Bound of Reconstruction Error for Spectral Filting",
  BOOKTITLE =    "Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Berlin, Germany",
  month =        "September",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}


Multiplicative Data Perturbation

• Kun Liu, Hillol Kargupta, and Jessica Ryan, Random Projection-based Multiplicative Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering (TKDE), VOL. 18, NO. 1, pages 92--106, Piscataway, NJ, January 2006. [link]

[abstract]

This paper explores the possibility of using multiplicative random projection matrices for privacy preserving distributed data mining. It specifically considers the problem of computing statistical aggregates like the inner product matrix, correlation coefficient matrix, and Euclidean distance matrix from distributed privacy sensitive data possibly owned by multiple parties. This class of problems is directly related to many other data-mining problems such as clustering, principal component analysis, and classification. This paper makes primary contributions on two different grounds. First, it explores Independent Component Analysis as a possible tool for breaching privacy in deterministic multiplicative perturbation-based models such as random orthogonal transformation and random rotation. Then, it proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data. The paper presents extensive theoretical analysis and experimental results. Experiments demonstrate that the proposed technique is effective and can be successfully used for different types of privacypreserving data mining applications.

[bib]

@ARTICLE{Liu_06a,
    AUTHOR      = {Kun Liu and Hillol Kargupta and Jessica Ryan},
    TITLE       = {{Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining}},
    JOURNAL     = {{IEEE Transactions on Knowledge and Data Engineering (TKDE)}},
    YEAR        = {2006},
    VOLUME      = {18},
    NUMBER      = {1},
    PAGES       = {92--106},
    MONTH       = {January},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {1041-4347},
    URL         = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.14},
    ABSTRACT    = {},
}

• Kun Liu, Chris Giannella and Hillol Kargupta, "An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining," in Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September, 2006. [link]

[abstract]

We examine the effectiveness of distance preserving transformations in privacy preserving data mining. These techniques are potentially very useful in that some important data mining algorithms can be applied to the transformed data and produce exactly the same results as if applied to the original data e.g. distance-based clustering, k-nearest neighbor classification. However, the issue of how well the original data is hidden has, to our knowledge, not been carefully studied. We take a step in this direction by assuming the role of an attacker armed with two types of prior information regarding the original data. We examine how well the attacker can recover the original data from the transformed data and prior information. Our results offer insight into the vulnerabilities of distance preserving transformations.

[bib]

 
@INPROCEEDINGS{Liu_06b,
  AUTHOR =       "Kun Liu and Chris Giannella and Hillol Kargupta",
  TITLE =        "An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the 10th European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Berlin, Germany",
  month =        "September",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Hillol Kargupta, Kun Liu, and Jessica Ryan, "Privacy Sensitive Distributed Data Mining from Multi-party Data," Proceedings of the first NSF/NIJ Symposium on Intelligence and Security Informatics, (Tucson, AZ), June 2003, pp. 336-342. [link]

[abstract]

Privacy is becoming an increasingly important issue in data mining, particularly in security and counter-terrorism-related applications where the data is often sensitive. This paper considers the problem of mining privacy sensitive distributed multi-party data. It specifically considers the problem of computing statistical aggregates like the correlation matrix from privacy sensitive data where the program for computing the aggregates is not trusted by the owner(s) of the data. It presents a brief overview of a random projection-based technique to compute the correlation matrix from a single third-party data site and also multiple homogeneous sites.

[bib]

@INPROCEEDINGS{Kargupta_03,
  AUTHOR =       "Hillol Kargupta and Kun Liu and Jessica Ryan",
  TITLE =        "Privacy Sensitive Distributed Data Mining from Multi-party Data",
  BOOKTITLE =    "Proceedings of the first NSF/NIJ Symposium on Intelligence and Security Informatics",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "Lecture Notes in Computer Science",
  pages =        "336--342",
  address =      "Tucson, AZ",
  month =        "June",
  organization = "",
  publisher =    "Springer Berlin/Heidelberg",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• J.J. Kim and W.E. Winkler, “Multiplicative Noise for Masking Continuous Data,” Technical Report Statistics #2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., Apr. 2003. [link]

[abstract]

To protect the identity of the persons or firms on a microdata file, noise is sometimes added to the data before releasing it to the public. There has been conjecture that, rather than adding noise, multiplying noise might better protect the confidentiality. Two forms of multiplicative noise are considered. The first approach is generating random numbers which have mean one and small variance, and multiplying the original data by the noise. The second approach is to take a logarithmic transformation, compute a covariance matrix of the transformed data, generate random number which follows mean zero and variance/covariance c times the variance/covariance computed in the previous step, add the noise to the transformed data and take an antilog of the noise added data. This paper investigates the statistical properties of both methods and shows how well they protect the identity of those on the file via reidentification trials.

[bib]

@TECHREPORT{Kim_03,
  AUTHOR =       "J. J. Kim and W. E. Winkler",
  TITLE =        "Multiplicative Noise for Masking Continuous Data",
  INSTITUTION =  "Statistical Research Division, U.S. Bureau of the Census",
  YEAR =         "2003",
  type =         "",
  number =       "Statistics \#2003-01",
  address =      "Washington D.C.",
  month =        "April",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
  file = F
}

• K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), pages 589–592, Houston, TX, November 2005. [link]

[abstract]

This paper presents a random rotation perturbation approach for privacy preserving data classification. Concretely, we identify the importance of classificationspecific information with respect to the loss of information factor, and present a random rotation perturbation framework for privacy preserving data classification. Our approach has two unique characteristics. First, we identify that many classification models utilize the geometric properties of datasets, which can be preserved by geometric rotation. We prove that the three types of classifiers will deliver the same performance over the rotation perturbed dataset as over the original dataset. Second, we propose a multi-column privacy model to address the problems of evaluating privacy quality for multidimensional perturbation. With this metric, we develop a local optimal algorithm to find the good rotation perturbation in terms of privacy guarantee. We also analyze both naive estimation and ICA-based reconstruction attacks with the privacy model. Our initial experiments show that the random rotation approach can provide high privacy guarantee while maintaining zero-loss of accuracy for the discussed classifiers.

[bib]

@INPROCEEDINGS{Chen_05,
  AUTHOR =       "K. Chen and L. Liu",
  TITLE =        "Privacy Preserving Data Classification with Rotation Perturbation",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "589--592",
  address =      "Houston, TX",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S. R. M. Oliveira and O. R. Zaiane. Privacy preservation when sharing data for clustering. In Proceedings of the International Workshop on Secure Data Management in a Connected World, pages 67–82, Toronto, Canada, August 2004. [link]

[abstract]

In this paper, we address the problem of protecting the underlying attribute values when sharing data for clustering. The challenge is how to meet privacy requirements and guarantee valid clustering results as well. To achieve this dual goal, we propose a novel spatial data transformation method called Rotation-Based Transformation (RBT). The major features of our data transformation are: a) it is independent of any clustering algorithm, b) it has a sound mathematical foundation; c) it is efficient and accurate; and d) it does not rely on intractability hypotheses from algebra and does not require CPU-intensive operations. We show analytically that although the data are transformed to achieve privacy, we can also get accurate clustering results by the safeguard of the global distances between data points.

[bib]

@INPROCEEDINGS{Oliveira_04p,
  AUTHOR =       "S. R. M. Oliveira and O. R. {Za\"{i}ane}",
  TITLE =        "Privacy Preservation When Sharing Data For Clustering",
  BOOKTITLE =    "Proceedings of the International Workshop on Secure Data Management in a Connected World",
  YEAR =         "2004",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "67--82",
  address =      "Toronto, Canada",
  month =        "August",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}




Categorical or General Data Perturbation

• Alexandre Evfimievski, "Randomization in Privacy Preserving Data Mining," ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 43-48, December 2002. [link]

[abstract]

Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

[bib]

''this is the bibtex entry ...''

• A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proc. ACM SIGMOD/PODS Conf., June 2003. [link]

[abstract]

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.

[bib]

@INPROCEEDINGS{Evfimievski:03,
  AUTHOR =       "A. Evfimevski and J. Gehrke and R. Srikant",
  TITLE =        "Limiting Privacy Breaches in Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD/PODS Conference",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "211--222",
  address =      "San Diego, CA",
  month =        "June",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}


Data Anonymization

• Latanya Sweeney, "k-anonymity: a model for protecting privacy," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Volume 10, Issue 5, pp. 557-570, 2002. [link]

[abstract]

Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, µ-Argus and k-Similar provide guarantees of privacy protection.

[bib]

@article{774552,
 author = {Latanya Sweeney},
 title = {k-anonymity: a model for protecting privacy},
 journal = {Int. J. Uncertain. Fuzziness Knowl.-Based Syst.},
 volume = {10},
 number = {5},
 year = {2002},
 issn = {0218-4885},
 pages = {557--570},
 doi = {http://dx.doi.org/10.1142/S0218488502001648},
 publisher = {World Scientific Publishing Co., Inc.},
 address = {River Edge, NJ, USA},
 }

• L. Sweeney, "Achieving k-anonymity privacy protection using generalization and suppression," International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), pp. 571-588, 2002. [link]

[abstract]

Often a data holder, such as a hospital or bank, needs to share person-specific records in such a way that the identities of the individuals who are the subjects of the data cannot be determined. One way to achieve this is to have the released records adhere to k- anonymity, which means each released record has at least (k-1) other records in the release whose values are indistinct over those fields that appear in external data. So, k- anonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals even if the records are directly linked to external information. This paper provides a formal presentation of combining generalization and suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a value with a less specific but semantically consistent value. Suppression involves not releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms Datafly and µ-Argus are compared to MinGen. Both Datafly and µ-Argus use heuristics to make approximations, and so, they do not always yield optimal results. It is shown that Datafly can over distort data and µ-Argus can additionally fail to provide adequate protection.

[bib]

@article{774553,
 author = {Latanya Sweeney},
 title = {Achieving k-anonymity privacy protection using generalization and suppression},
 journal = {Int. J. Uncertain. Fuzziness Knowl.-Based Syst.},
 volume = {10},
 number = {5},
 year = {2002},
 issn = {0218-4885},
 pages = {571--588},
 doi = {http://dx.doi.org/10.1142/S021848850200165X},
 publisher = {World Scientific Publishing Co., Inc.},
 address = {River Edge, NJ, USA},
 }

• A. Meyerson and R. Williams, "General k-anonymization is Hard," Carnegie Mellon School of Computer Science Tech Report, 03-113, 2003. [link]

[abstract]

This is the abstract ...

[bib]

''this is the bibtex entry ...''

• K. Wang, P. S. Yu, and S. Chakraborty, "Bottom-up generalization: a data mining solution to privacy protection," in Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 2004. [link]

[abstract]

The well-known privacy-preserved data mining modifies existing data mining techniques to randomized data. In this paper, we investigate data mining as a technique for masking data, therefore, termed data mining based privacy protection. This approach incorporates partially the requirement of a targeted data mining task into the process of masking data so that essential structure is preserved in the masked data. The idea is simple but novel: we explore the data generalization concept from data mining as a way to hide detailed information, rather than discover trends and patterns. Once the data is masked, standard data mining techniques can be applied without modification. Our work demonstrated another positive use of data mining technology: not only can it discover useful patterns, but also mask private information.

We consider the following privacy problem: a data holder wants to release a version of data for building classification models, but wants to protect against linking the released data to an external source for inferring sensitive information. We adapt an iterative bottom-up generalization from data mining to generalize the data. The generalized data remains useful to classification but becomes difficult to link to other sources. The generalization space is specified by a hierarchical structure of generalizations. A key is identifying the best generalization to climb up the hierarchy at each iteration. Enumerating all candidate generalizations is impractical. We present a scalable solution that examines at most one generalization in each iteration for each attribute involved in the linking.

[bib]

@inproceedings{WYC04icdm,
  author = "K. Wang and P. S. Yu and S. Chakraborty",
  title = "Bottom-up generalization: a data mining solution to privacy protection",
  booktitle = "Proc. of the 4th IEEE International Conference on Data Mining (ICDM 2004)",
  month = "November",
  year = "2004",
}

• Roberto J. Bayardo and Rakesh Agrawal, "Data Privacy Through Optimal k-Anonymization," in Proceedings of the 21st International Conference on Data Engineering (ICDE'05) - Volume 00, pp. 217-228, 2005. [link]

[abstract]

Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimalk-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a nontrivial dataset under a general model of the problem.

[bib]

@inproceedings{1054048,
 author = {Roberto J. Bayardo and Rakesh Agrawal},
 title = {Data Privacy through Optimal k-Anonymization},
 booktitle = {ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05)},
 year = {2005},
 isbn = {0-7695-2285-8},
 pages = {217--228},
 doi = {http://dx.doi.org/10.1109/ICDE.2005.42},
 publisher = {IEEE Computer Society},
 address = {Washington, DC, USA},
 }

• Sheng Zhong and Zhiqiang Yang and Rebecca N. Wright, "Privacy-enhancing k-anonymization of customer data," in Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 139--147, 2005. [link]

[abstract]

In order to protect individuals' privacy, the technique of k-anonymization has been proposed to de-associate sensitive attributes from the corresponding identifiers. In this paper, we provide privacy-enhancing methods for creating k-anonymous tables in a distributed scenario. Specifically, we consider a setting in which there is a set of customers, each of whom has a row of a table, and a miner, who wants to mine the entire table. Our objective is to design protocols that allow the miner to obtain a k-anonymous table representing the customer data, in such a way that does not reveal any extra information that can be used to link sensitive attributes to corresponding identifiers, and without requiring a central authority who has access to all the original data. We give two different formulations of this problem, with provably private solutions. Our solutions enhance the privacy of k-anonymization in the distributed scenario by maintaining end-to-end privacy from the original customer data to the final k-anonymous results.

[bib]

@inproceedings{1065185,
 author = {Sheng Zhong and Zhiqiang Yang and Rebecca N. Wright},
 title = {Privacy-enhancing k-anonymization of customer data},
 booktitle = {PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems},
 year = {2005},
 isbn = {1-59593-062-0},
 pages = {139--147},
 location = {Baltimore, Maryland},
 doi = {http://doi.acm.org/10.1145/1065167.1065185},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• B. C. M. Fung, K. Wang, and P. S. Yu, "Top-down specialization for information and privacy preservation," in Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005, pp. 205-216. [link]

[abstract]

Releasing person-specific data in its most specific state poses a threat to individual privacy. This paper presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modelling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. Our approach exploits the fact that data usually contains redundant structures for classification. While generalization may eliminate some structures, other structures emerge to help. Our results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity.

[bib]

@inproceedings{FWY05icde,
  author = "B. C. M. Fung and K. Wang and P. S. Yu",
  title = "Top-Down Specialization for Information and Privacy Preservation",
  booktitle = "Proc. of the 21st IEEE International Conference on Data Engineering (ICDE 2005)",
  pages = "205-216",
  address = "Tokyo, Japan",
  month = "April",
  year = "2005",
}

• K. Wang, B. C. M. Fung, and G. Dong, "Integrating private databases for data analysis," in Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI 2005), Atlanta, GA, May 2005, pp. 171-182. [link]

[abstract]

In today’s globally networked society, there is a dual demand on both information sharing and information protection. A typical scenario is that two parties wish to integrate their private databases to achieve a common goal beneficial to both, provided that their privacy requirements are satisfied. In this paper, we consider the goal of building a classifier over the integrated data while satisfying the k-anonymity privacy requirement. The k-anonymity requirement states that domain values are generalized so that each value of some specified attributes identifies at least k records. The generalization process must not leak more specific information other than the final integrated data. We present a practical and efficient solution to this problem.

[bib]

@inproceedings{WFD05isi,
  author = "K. Wang and B. C. M. Fung and G. Dong",
  title = "Integrating Private Databases for Data Analysis",
  booktitle = "Proc. of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI 2005)",
  series = "Lecture Notes in Computer Science",
  volume = "3495",
  pages = "171-182",
  address = "Atlanta, GA",
  month = "May",
  year = "2005",
}

• K. Wang, B. C. M. Fung, and P. S. Yu, "Template-based privacy preservation in classification problems," in Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, TX, November 2005, pp. 466-473. [link]

[abstract]

In this paper, we present a template-based privacy preservation to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted classification analysis and limit the usefulness of unwanted sensitive inferences that may be derived from the data. Sensitive inferences are specified by a set of "privacy templates". Each template speci es the sensitive information to be protected, a set of identifying attributes, and the maximum association between the two. We show that suppressing the domain values is an effective way to eliminate sensitive inferences. For a large data set, finding an optimal suppression is hard, since it requires optimization over all suppressions. We present an approximate but scalable solution. We demonstrate the effectiveness of this approach on real life data sets.

[bib]

@inproceedings{WFY05icdm,
  author = "K. Wang and B. C. M. Fung and P. S. Yu",
  title = "Template-Based Privacy Preservation in Classification Problems",
  booktitle = "Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005)",
  address = "Houston, TX",
  pages = "466-473",
  month = "November",
  year = "2005",
}

• M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, "k-Anonymous Patterns," In Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'05) Lecture Notes in Computer Science, Volume 3721, Springer. October 3-7, 2005, Porto, Portugal. [link]

[abstract]

It is generally believed that data mining results do not violate the anonymity of the individuals recorded in the source database. In fact, data mining models and patterns, in order to ensure a required statistical significance, represent a large number of individuals and thus conceal individual identities: this is the case of the minimum support threshold in association rule mining. In this paper we show that this belief is ill-founded. By shifting the concept of k-anonymity from data to patterns, we formally characterize the notion of a threat to anonymity in the context of pattern discovery, and provide a methodology to ef ciently and effectively identify all possible such threats that might arise from the disclosure of a set of extracted patterns.

[bib]

@INPROCEEDINGS{Atzori_05a,
  AUTHOR =       "M. Atzori, F. Bonchi, F. Giannotti and D. Pedreschi",
  TITLE =        "k-Anonymous Patterns",
  BOOKTITLE =    "Proceedings of the Ninth European Conference on Principles
                  and Practice of Knowledge Discovery in Databases (PKDD'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "3721",
  number =       "",
  series =       "Lecture Notes in Computer Science, Springer",
  pages =        "",
  address =      "Porto, Portugal",
  month =        "October",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, "Blocking Anonymity Threats Raised by Frequent Itemset Mining," In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05), IEEE. November 27-30, 2005, New Orleans, Louisiana. [link]

[abstract]

In this paper we study when the disclosure of data min- ing results represents, per se, a threat to the anonymity of the individuals recorded in the analyzed database. The novelty of our approach is that we focus on an objective definition of privacy compliance of patterns without any reference to a preconceived knowledge of what is sensi- tive and what is not, on the basis of the rather intuitive and realistic constraint that the anonymity of individu- als should be guaranteed. In particular, the problem ad- dressed here arises from the possibility of inferring from the output of frequent itemset mining (i.e., a set of item- sets with support larger than a threshold \sigma), the exis- tence of patterns with very low support (smaller than an anonymity threshold k). In the following we develop a simple methodology to block such inference opportuni- ties by introducing distortion on the dangerous patterns.

[bib]

@INPROCEEDINGS{Atzori_05b,
  AUTHOR =       "M. Atzori, F. Bonchi, F. Giannotti and D. Pedreschi",
  TITLE =        "Blocking Anonymity Threats Raised by Frequent Itemset Mining",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan Venkitasubramaniam, "l -Diversity: Privacy Beyond k-Anonymity," p. 24, 22nd International Conference on Data Engineering (ICDE'06), 2006. [link]

[abstract]

Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k−1 other records with respect to certain “identifying” attributes. In this paper we show with two simple attacks that a k-anonymized dataset has some subtle, but severe privacy problems. First, we show that an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks and we propose a novel and powerful privacy defi- nition called l-diversity. In addition to building a formal foundation for l-diversity, we show in an experimental evaluation that ℓ-diversity is practical and can be implemented efficiently.

[bib]

@inproceedings{Machanavajjhala_06,
  author    = {Ashwin Machanavajjhala and
               Johannes Gehrke and
               Daniel Kifer and
               Muthuramakrishnan Venkitasubramaniam},
  title     = {l-Diversity: Privacy Beyond k-Anonymity.},
  booktitle = {Proceedings of the 22nd International Conference on Data
               Engineering (ICDE'06)},
  year      = {2006},
  pages     = {24},
  ee        = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2006.1},
  crossref  = {DBLP:conf/icde/2006},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

• K. Wang, B. C. M. Fung, and P. S. Yu, "Handicapping attacker's confidence: an alternative to k-anonymization," in Knowledge and Information Systems: An International Journal (KAIS), 2006. [link]

[abstract]

We present an approach of limiting the con dence of inferring sensitive properties to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted data analysis request and limit the usefulness of unwanted sensitive inferences that may be derived from the release of data. Sensitive inferences are speci ed by a set of "privacy templates". Each template speci es the sensitive property to be protected, the attributes identifying a group of individuals, and a maximum threshold for the con dence of inferring the sensitive property given the identifying attributes. We show that suppressing the domain values monotonically decreases the maximum con dence of such sensitive inferences. Hence, we propose a data transformation that minimally suppresses the domain values in the data to satisfy the set of privacy templates. The transformed data is free of sensitive inferences even in the presence of data mining algorithms. The prior k-anonymization focuses on personal identities. This work focuses on the association between personal identities and sensitive properties.

[bib]

@article{WFY06kais,
  author = "K. Wang and B. C. M. Fung and P. S. Yu",
  title = "Handicapping Attacker's Confidence: An Alternative to k-Anonymization",
  journal = "Knowledge and Information Systems: An International Journal (KAIS)",
  publisher = "Springer-Verlag",
  year = "2006",
}


Data Swapping

• S.E. Fienberg and J. McIntyre, “Data Swapping: Variations on a Theme by Dalenius and Reiss,” technical report, Nat’l Inst. of Statistical Sciences, Research Triangle Park, NC, 2003.[link]

[abstract]

Data swapping, a term introduced in 1978 by Dalenius and Reiss for a new method of statistical disclosure protection in confidential data bases, has taken on new meanings and been linked to new statistical methodologies over the intervening twenty-five years. This paper revisits the original (1982) published version of the the Dalenius-Reiss data swapping paper and then traces the developments of statistical disclosure limitation methods that can be thought of as rooted in the original concept. The emphasis here, as in the original contribution, is on both disclosure protection and the release of statistically usable data bases.

[bib]

@TECHREPORT{Fienberg_03,
  AUTHOR =       "S. E. Fienberg and J. McIntyre",
  TITLE =        "Data Swapping: Variations on a Theme by Dalenius and Reiss",
  INSTITUTION =  "National Institute of Statistical Sciences",
  YEAR =         "2003",
  type =         "",
  number =       "",
  address =      "Research Triangle Park, NC",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
  file = F
}

• Shanti Gomatam, Alan F. Karr and Ashish P. Sanil, "Data Swapping as a Decision Problem," 2004. [link]

[abstract]

We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on the unswapped attributes. Risk–utility frontiers, consisting of those candidates not dominated in (risk, utility) space by any other candidate, are a principal tool for reducing the scale of the decision problem. Multiple measures of disclosure risk and data utility, including utility measures based directly on use of the swapped data for statistical inference, are introduced. Their behavior and resulting insights into the decision problem are illustrated using data from the Current Population Survey, the well-studied “Czech auto worker data” and data on schools and administrators generated by the National Center for Education Statistics.

[bib]

''this is the bibtex entry ...''



Randomized Response

• Andris Ambainis and Markus Jakobsson and Helger Lipmaa, "Cryptographic Randomized Response Techniques," Cryptology ePrint Archive: Report 2003/027. [link]

[abstract]

We develop cryptographically secure techniques to guarantee unconditional privacy for respondents to polls. Our constructions are efficient and practical, and are shown not to allow cheating respondents to affect the "tally" by more than their own vote - which will be given the exact same weight as that of other respondents. We demonstrate solutions to this problem based on both traditional cryptographic techniques and quantum cryptography.

[bib]

@misc{cryptoeprint:2003:027,
    author = {Andris Ambainis and Markus Jakobsson and Helger Lipmaa},
    title = {Cryptographic Randomized Response Techniques},
    howpublished = {Cryptology ePrint Archive, Report 2003/027},
    year = {2003},
    note = {\url{http://eprint.iacr.org/}},
}

• Wenliang Du and Zhijun Zhan, "Using Randomized Response Techniques for Privacy-Preserving Data Mining," SIDKDD 2003. [link]

[abstract]

Privacy is an important issue in data mining and knowledge discovery. In this paper, we propose to use the randomized response techniques to conduct the data mining computation. Specially, we present a method to build decision tree classifiers from the disguised data. We conduct experiments to compare the accuracy ofou r decision tree with the one built from the original undisguised data. Our results show that although the data are disguised, our method can still achieve fairly high accuracy. We also show how the parameter used in the randomized response techniques affects the accuracy ofth e results.

[bib]

''this is the bibtex entry ...''



Cryptographic/Secure Multi-Party Computation (SMC)

• B. Pinkas, “Cryptographic Techniques for Privacy Preserving Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12-19, 2002. [link]

[abstract]

Research in secure distributed computation, which was done as part of a larger body of research in the theory of cryptography, has achieved remarkable results. It was shown that non-trusting parties can jointly compute functions of their different inputs while ensuring that no party learns anything but the defined output of the function. These results were shown using generic constructions that can be applied to any function that has an efficient representation as a circuit. We describe these results, discuss their efficiency, and demonstrate their relevance to privacy preserving computation of data mining algorithms. We also show examples of secure computation of data mining algorithms that use these generic constructions.

[bib]

@ARTICLE{Pinkas_02,
  AUTHOR =       "B. Pinkas",
  TITLE =        "Cryptographic Techniques for Privacy Preserving Data Mining",
  JOURNAL =      "SIGKDD Explorations",
  YEAR =         "2002",
  volume =       "4",
  number =       "2",
  pages =        "12--19 ",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
}

• M.J. Atallah and W. Du, “Secure Multi-Party Computational Geometry,” Proc. WADS2001: Seventh Int’l Workshop on Algorithms and Data Structures, pp. 165-179, Aug. 2001. [link]

[abstract]

The general secure multi-party computation problem is when multiple parties (say, Alice and Bob) each have private data (respectively, a and b) and seek to compute some function f(a; b) without revealing to each other anything unintended (i.e., anything other than what can be inferred from knowing f(a; b)). It is well known that, in theory, the general secure multi-party computation problem is solvable using circuit evaluation protocols. While this approach is appealing in its general- ity, the communication complexity of the resulting protocols depend on the size of the circuit that expresses the functionality to be computed. As Goldreich has recently pointed out [6], using the solutions derived from these general results to solve speci c problems can be impractical; problem-speci c solutions should be developed, for eÆciency reasons. This paper is a rst step in this direction for the area of computational geometry. We give simple solutions to some speci c geometric problems, and in doing so we develop some building blocks that we believe will be useful in the solution of other geometric and combinatorial problems as well.

[bib]

@INPROCEEDINGS{Atallah_01,
  AUTHOR =       "M. J. Atallah and W. Du",
  TITLE =        "Secure Multi-Party Computational Geometry",
  BOOKTITLE =    "WADS2001: Seventh International Workshop on Algorithms and Data Structures",
  YEAR =         "2001",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "165--179",
  address =      "Providence, Rhode Island",
  month =        "August",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• W. Du and M. J. Atallah, "Secure MultiParty Computation Problems and Their Applications: A Review and Open Problems," in Proceedings of the 2001 Workshop on New Security Paradigms, pp. 13-22, 2001. [link]

[abstract]

The growth of the Internet has triggered tremendous opportunities for cooperative computation, where people are jointly conducting computation tasks based on the private inputs they each supplies. These computations could occur between mutually untrusted parties, or even between competitors. For example, customers might send to a remote database queries that contain private information; two competing nancial organizations might jointly invest in a project that must satisfy both organizations' private and valuable constraints, and so on. Today, to conduct such computations, one entity must usually know the inputs from all the participants; however if nobody can be trusted enough to know all the inputs, privacy will become a primary concern.

This problem is referred to as Secure Multi-party Computation Problem (SMC) in the literature. Research in the SMC area has been focusing on only a limited set of specific SMC problems, while privacy concerned cooperative computations call for SMC studies in a variety of computation domains. Before we can study the problems, we need to identify and define the specific SMC problems for those computation domains. We have developed a framework to facilitate this problem-discovery task. Based on our framework, we have identified and de ned a number of new SMC problems for a spectrum of computation domains. Those problems include privacy-preserving database query, privacy-preserving scientific computations, privacy-preserving intrusion detection, privacy-preserving statistical analysis, privacy-preserving geometric computations, and privacy-preserving data mining.

The goal of this paper is not only to present our results, but also to serve as a guideline so other people can identify useful SMC problems in their own computation domains.

[bib]

@inproceedings{Du:01a,
    author      = {W. Du and M. J. Atallah},
    title       = {Secure Multi-Party Computation Problems and Their
                   Applications: A Review and Open Problems},
    booktitle   = {Proceedings of the 2001 Workshop on New Security Paradigms},
    pages       = {13-22},
    month       = {September},
    address     = {Cloudcroft, NM},
    year        = 2001,
    publisher   = "ACM Press"
}

• O. Goldreich, The Foundations of Cryptography, vol. 2, chapter 7. Cambridge Univ. Press, 2004. [link]

[abstract]

The current volume is the second part (or volume) of the two-volume work Foundations of Cryptography. The first part (i.e., Volume 1) consists of an introductory chapter (Chapter 1), followed by chapters on computational difficulty (one-way functions), pseudorandomness and zero-knowledge proofs (Chapters 2-4, respectively). The current volume consists of three chapters, which treat encryption schemes (Chapter 5), signature schemes (Chapter 6), and general cryptographic protocols (Chapter 7), respectively. Also included is an appendix that provides corrections and additions to Volume 1.

[bib]

@INBOOK{Goldreich_04,
  AUTHOR =       "O. Goldreich",
  editor =       "",
  TITLE =        "The Foundations of Cryptography",
  CHAPTER =      "7",
  pages =        "",
  PUBLISHER =    "Cambridge University Press",
  YEAR =         "2004",
  volume =       "2",
  number =       "",
  series =       "",
  type =         "",
  address =      "",
  edition =      "",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• Joan Feigenbaum and Yuval Ishai and Tal Malkin and Kobbi Nissim and Martin J. Strauss and Rebecca N. Wright, "Secure Multiparty Computation of Approximations," [link]

[abstract]

Approximation algorithms can sometimes provide efficient solutions when no efficient exact computation is known. In particular, approximations are often useful in a distributed setting where the inputs are held by di erent parties and are extremely large. Furthermore, for some applications, the parties want to cooperate to compute a function of their inputs without revealing more information than necessary. If \hat{f} is an approximation to f, secure multiparty computation of \hat{f} allows the parties to compute \hat{f} without revealing unnecessary information. However, secure computation of \hat{f} may not be as private as secure computation of f, because the output of \hat{f} may itself reveal more information than the output of f. In this paper, we present definitions of secure multiparty approximate computations that retain the privacy of a secure computation of f. We present an efficient, sublinear-communication, private approximate computation for the Hamming distance and an efficient private approximation of the permanent.

[bib]

''this is the bibtex entry ...''

• Wenliang Du, "A Study Of Several Specific Secure Two-Party Computation Problems," Ph.D. thesis, 2001. [link]

[abstract]

Alice has a private input $x$ (of any data type, such as a number, a matrix or a data set). Bob has another private input $y$. Alice and Bob want to cooperatively conduct a specific computation on $x$ and $y$ without disclosing to the other person any information about her or his private input except for what could be derived from the results. This problem is a Secure Two-party Computation (STC) problem, which has been extensively studied in the past. Several generic solutions have been proposed to solve the general STC problem; however the generic solutions are often too inefficient to be practical. Therefore, in this dissertation, we study several specific STC problems with the goal of finding more efficient solutions than the generic ones. We introduce a number of specific STC problems in the domains of scientific computation, statistical analysis, computational geometry and database query. Most of the problems have not been studied before in the literature. To solve these problems: (1) We investigate how data perturbation could be used to hide data. Data perturbation hides a datum by adding to it a random number. We show that this technique is effective in preserving privacy. (2) We explore how domain specific knowledge can improve the efficiency of the solutions that we develop over the generic solutions that do not consider domain specific knowledge. We show that such knowledge is important in both hiding data and achieving higher efficiency. (3) We also introduce a number of common building blocks that are useful in solving secure two-party computation problems in various computation domains.

[bib]

@PhdThesis{DuThesis2001,
  author =      {W. Du},
  title =       {A Study of Several Specific Secure Two-party Computation Problems},
  school =      {Purdue University},
  address =     {West Lafayette, Indiana},
  year =        2001
}

• Wenliang Du, Zhijun Zhang, "A Practical Approach to Solve Secure Multi-party Computation," in NSPW '02: Proceedings of the 2002 workshop on New security paradigms, pp. 127-135, 2002. [link]

[abstract]

Secure Multi-party Computation (SMC) problems deal with the following situation: Two (or many) parties want to jointly perform a computation. Each party needs to con- tribute its private input to this computation, but no party should disclose its private inputs to the other parties, or to any third party. With the proliferation of the Internet, SMC problems becomes more and more important. So far no practical solution has emerged, largely because SMC studies have been focusing on zero information disclosure, an ideal security model that is expensive to achieve.

Aiming at developing practical solutions to SMC problems,

we propose a new paradigm, in which we use an acceptable security model that allows partial information disclosure. Our conjecture is that by lowering the restriction on the security, we can achieve a much better performance. The paradigm is motivated by the observation that in practice people do accept a less secure but much more e cient solu- tion because sometimes disclosing information about their private data to certain degree is a risk that many people would rather take if the performance gain is so significant. Moreover, in our paradigm, the security is adjustable, such that users can adjust the level of security based on their de nition of the acceptable security. We have developed a number of techniques under this new paradigm, and are currently conducting extensive studies based on this new paradigm.

[bib]

@inproceedings{844125,
 author = {Wenliang Du and Zhijun Zhan},
 title = {A practical approach to solve Secure Multi-party Computation problems},
 booktitle = {NSPW '02: Proceedings of the 2002 workshop on New security paradigms},
 year = {2002},
 isbn = {1-58113-598-X},
 pages = {127--135},
 location = {Virginia Beach, Virginia},
 doi = {http://doi.acm.org/10.1145/844102.844125},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Andris Ambainis and Markus Jakobsson and Helger Lipmaa, "Cryptographic Randomized Response Techniques," Cryptology ePrint Archive: Report 2003/027. [link]

[abstract]

We develop cryptographically secure techniques to guarantee unconditional privacy for respondents to polls. Our constructions are efficient and practical, and are shown not to allow cheating respondents to affect the "tally" by more than their own vote - which will be given the exact same weight as that of other respondents. We demonstrate solutions to this problem based on both traditional cryptographic techniques and quantum cryptography.

[bib]

@misc{cryptoeprint:2003:027,
    author = {Andris Ambainis and Markus Jakobsson and Helger Lipmaa},
    title = {Cryptographic Randomized Response Techniques},
    howpublished = {Cryptology ePrint Archive, Report 2003/027},
    year = {2003},
    note = {\url{http://eprint.iacr.org/}},
}

• Bart Goethals, Sven Laur, Helger Lipmaa and Taneli Mielikäinen. On Private Scalar Product Computation for Privacy-Preserving Data Mining. The 7th Annual International Conference in Information Security and Cryptology (ICISC 2004), volume 3506 of Lecture Notes in Computer Science, pages 104--120, 2004. [link]

[abstract]

In mining and integrating data from multiple sources, there are many privacy and security issues. In several different contexts, the security of the full privacy-preserving data mining protocol depends on the security of the underlying private scalar product protocol. We show that two of the private scalar product protocols, one of which was proposed in a leading data mining conference, are insecure.We then describe a provably private scalar product protocol that is based on homomorphic encryption and improve its efficiency so that it can also be used on massive datasets.

[bib]

''this is the bibtex entry ...''




Privacy Preserving Classification

• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data, pp. 439-450, May 2000. [link]

[abstract]

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.

[bib]

@INPROCEEDINGS{Agrawal:00,
  AUTHOR =       "R. Agrawal and R. Srikant",
  TITLE =        "Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD Conference on Management of Data",
  YEAR =         "2000",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "439--450",
  address =      "Dallas, TX",
  month =        "May",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Y. Lindell and B. Pinkas, "Privacy Preserving Data Mining," Advances in Cryptology (CRYPTO'00), pp. 36--53, 2000. [link]

[abstract]

In this paper we address the issue of privacy preserving data mining. Specifically, we consider a scenario in which two parties owning confidential databases wish to run a data mining algorithm on the union of their databases, without revealing any unnecessary information. Our work is motivated by the need to both protect privileged information and enable its use for research or other purposes. The above problem is a specific example of secure multi-party computation and as such, can be solved using known generic protocols. However, data mining algorithms are typically complex and, furthermore, the input usually consists of massive data sets. The generic protocols in such a case are of no practical use and therefore more efficient protocols are required. We focus on the problem of decision tree learning with the popular ID3 algorithm. Our protocol is considerably more efficient than generic solutions and demands both very few rounds of communication and reasonable bandwidth.

[bib]

@INPROCEEDINGS{Lindell:00,
  AUTHOR =       "Y. Lindell and B. Pinkas",
  TITLE =        "Privacy Preserving Data Mining",
  BOOKTITLE =    "Advances in Cryptology (CRYPTO'00)",
  YEAR =         "2000",
  editor =       "",
  volume =       "1880",
  number =       "",
  series =       "Lecture Notes in Computer Science",
  pages =        "36--53",
  address =      "",
  month =        "",
  organization = "",
  publisher =    "Springer-Verlag",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• W. Du and Z. Zhan, “Building Decision Tree Classifier on Private Data,” Proc. IEEE Int’l Conf. Privacy, Security, and Data Mining, pp. 1-8, Dec. 2002. [link]

[abstract]

This paper studies how to build a decision tree classifier under the following scenario:a database is vertically partitioned into two pieces, with one piece owned by Alice and the other piece owned by Bob. Alice and Bob want to build a decision tree classi- fier based on such a database, but due to the privacy constraints, neither of them wants to disclose their private pieces to the other party or to any third party. We present a protocol that allows Alice and Bob to conduct such a classifier building without having to compromise their privacy. Our protocol uses an untrusted third-party server, and is built upon a useful building block, the scalar product protocol. Our solution to the scalar product protocol is more efficient than any existing solutions.

[bib]

@inproceedings{Du:02,
    author      = {W. Du and Z. Zhan},
    title       = {Building Decision Tree Classifier on Private Data},
    booktitle   = {Proceedings of the IEEE International Conference on Privacy, Security and Data Mining},
    month       = {December},
    address     = {Maebashi City, Japan},
    year        = 2002,
    pages       = "1--8",
    publisher   = "Australian Computer Society, Inc.",
}

• Wenliang Du and Zhiju