Privacy Preserving Data Mining Bibliography

 

What's New:
- Welcome to the privacy preserving data mining (PPDM) bibliography. The goal of this page is to maintain and distribute a bibliography of PPDM-related publications. I hope that PPDM researchers and practitioners find this service useful. I welcome every help from the community in maintaining the bibliography.
- Any questions or comments should be directed to Kun Liu <kunliu1 AT cs.umbc.edu>.


Contents

 

Privacy Preserving Data Mining Philosophical Issues

• Mary J. Cronin, "e-Privacy?," HOOVER DIGEST, No. 3, 2000. Adapted from the essay "Privacy and Electronic Commerce," by Mary J. Cronin, in the new Hoover Press book Public Policy and the Internet: Privacy, Taxes, and Contract, edited by Nicholas Imparato. [link]

[abstract]

Imagine going to a shopping mall in which researchers follow you from store to store, taking notes on every product you examine or buy. Would you shop in such a place? Chances are, you already do. Welcome to the Internet.

[bib]

''this is the bibtex entry ...''

• Alan J. Broder, "Data Mining, the Internet, and Privacy," International WEBKDD’99 Workshop San Diego, CA, USA, August 15, 1999. [link]

[abstract]

This paper addresses the inherent technological conflict between the desire for privacy by World Wide Web users, and the need of Web content providers and advertisers to more fully collect and utilize data about users. As the other papers in this volume illustrate, the Data Mining community has now turned its collective attention towards the Web as a fertile venue for research and development. In doing so, Data Miners have found themselves at the nexus of this conflict.

We present the technical issues regarding privacy from two perspectives. First, from the perspective of the Web user who may be unaware of the degree to which identifying information can be inadvertently disclosed. And second, from the perspective of a Data Miner we consider the extent to which privacy enhancing technologies could substantially invalidate data mining results.

[bib]

@INCOLLECTION{Broder_99,
  AUTHOR =       "Alan J. Broder",
  TITLE =        "Data Mining, the Internet, and Privacy",
  BOOKTITLE =    "Web Usage Analysis and User Profiling: International WEBKDD'99 Workshop San Diego, CA, USA, August 15, 1999 ",
  PUBLISHER =    "Springer Berlin/Heidelberg",
  YEAR =         "1999",
  editor =       "",
  volume =       "1836/2000",
  number =       "",
  series =       "",
  type =         "",
  chapter =      "p.56",
  pages =        "",
  address =      "",
  edition =      "",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• EPIC: Electronic Privacy Information Center, a public interest research center in Washington, D.C. [link]

[abstract]

[bib]


Privacy Preserving Data Mining Survey/General Issues

• V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, “State-of-the-Art in Privacy Preserving Data Mining,” ACM SIGMOD Record, vol. 3, no. 1, pp. 50-57, Mar. 2004. [link]

[abstract]

We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. A brief evaluation is performed, and some initial conclusions are made.

[bib]

@ARTICLE{Verykios_04v,
  AUTHOR =       "V. S. Verykios and E. Bertino and I. N. Fovino and L. P. Provenza and Y. Saygin and Y. Theodoridis",
  TITLE =        "State-of-the-art in Privacy Preserving Data Mining",
  JOURNAL =      "ACM SIGMOD Record",
  YEAR =         "2004",
  volume =       "3",
  number =       "1",
  pages =        "50--57",
  month =        "March",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, “Tools for Privacy Preserving Distributed Data Mining,” ACM SIGKDD Explorations, vol. 4, no. 2, 2003. [link]

[abstract]

Privacy preserving mining of distributed data has numerous applications. Each application poses di erent constraints: What is meant by privacy, what are the desired results, how is the data distributed, what are the constraints on collaboration and cooperative computing, etc. We suggest that the solution to this is a toolkit of components that can be combined for speci c privacy-preserving data mining applications. This paper presents some components of such a toolkit, and shows how they can be used to solve several privacy-preserving data mining problems.

[bib]

@article{Clifton:03,
    author      = {C. Clifton and M. Kantarcioglu and J. Vaidya and X. Lin and M. Zhu},
    title       = "Tools for Privacy Preserving Distributed Data Mining",
    journal     = {ACM SIGKDD Explorations},
    volume      = {4},
    number      = {2},
    year        = {2003},
}

• B. Pinkas, “Cryptographic Techniques for Privacy Preserving Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12-19, 2002. [link]

[abstract]

Research in secure distributed computation, which was done as part of a larger body of research in the theory of cryptography, has achieved remarkable results. It was shown that non-trusting parties can jointly compute functions of their different inputs while ensuring that no party learns anything but the defined output of the function. These results were shown using generic constructions that can be applied to any function that has an efficient representation as a circuit. We describe these results, discuss their efficiency, and demonstrate their relevance to privacy preserving computation of data mining algorithms. We also show examples of secure computation of data mining algorithms that use these generic constructions.

[bib]

@ARTICLE{Pinkas_02,
  AUTHOR =       "B. Pinkas",
  TITLE =        "Cryptographic Techniques for Privacy Preserving Data Mining",
  JOURNAL =      "SIGKDD Explorations",
  YEAR =         "2002",
  volume =       "4",
  number =       "2",
  pages =        "12--19 ",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
}

• Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton, "When do data mining results violate privacy?," in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. [link]

[abstract]

Privacy-preserving data mining has concentrated on obtaining valid results when the input data is private. An extreme example is Secure Multiparty Computation-based methods, where only the results are revealed. However, this still leaves a potential privacy breach: Do the results themselves violate privacy? This paper explores this issue, developing a framework under which this question can be addressed. Metrics are proposed, along with analysis that those metrics are consistent in the face of apparent problems.

[bib]

@inproceedings{1014126,
 author = {Murat Kantarcio\&\#487;lu and Jiashun Jin and Chris Clifton},
 title = {When do data mining results violate privacy?},
 booktitle = {KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining},
 year = {2004},
 isbn = {1-58113-888-1},
 pages = {599--604},
 location = {Seattle, WA, USA},
 doi = {http://doi.acm.org/10.1145/1014052.1014126},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Shipra Agrawal, Vijay Krishnan, Jayant Haritsa, "On Addressing Efficiency Concerns in Privacy Preserving Data Mining," In Proceedings of the 9th International Conference on Database Systems for Advanced Applications (DASFAA-2004). [link]

[abstract]

Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To encourage users to provide correct inputs, we recently proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results. However, mining the distorted database can be orders of magnitude more time-consuming as compared to mining the original database. In this paper, we address this issue and demonstrate that by (a) generalizing the distortion process to perform symbol-specific distortion, (b) appropriately choosing the distortion parameters, and (c) applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.

[bib]

''this is the bibtex entry ...''

• Chris Clifton and Murat Kantarcioglu and Jaideep Vaidya, "Defining Privacy for Data Mining," in Proceedings of the National Science Foundation Workshop on Next Generation Data Mining, November 1-3, 2002, Baltimore, MD. [link]

[abstract]

Privacy preserving data mining – getting valid data mining results without learning the underlying data values – has been receiving attention in the research community and beyond. It is unclear what privacy preserving means. This paper provides a framework and metrics for discussing the meaning of privacy preserving data mining, as a foundation for further research in this field.

[bib]

''this is the bibtex entry ...''




Additive Data Perturbation

• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data, pp. 439-450, May 2000. [link]

[abstract]

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers built with the original data.

[bib]

@INPROCEEDINGS{Agrawal:00,
  AUTHOR =       "R. Agrawal and R. Srikant",
  TITLE =        "Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD Conference on Management of Data",
  YEAR =         "2000",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "439--450",
  address =      "Dallas, TX",
  month =        "May",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• D. Agrawal and C. C. Aggarwal, "On the Design and Quantification of Privacy Preserving Data Mining Algorithms," Proc. ACM SIGMOD, pp. 247-255, 2001. [link]

[abstract]

The increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. A recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. This method is able to re- tain privacy while accessing the information implicit in the original attributes. The distribution reconstruction process naturally leads to some loss of information which is accept- able in many practical situations. This paper discusses an Expectation Maximization (EM) algorithm for distribution reconstruction which is more effective than the currently available method in terms of the level of information loss. Specifically, we prove that the EM algorithm converges to the maximum likelihood estimate of the original distribu- tion based on the perturbed data. We show that when a large amount of data is available, the EM algorithm pro- vides robust estimates of the original distribution. We pro- pose metrics for quantification and measurement of privacy- preserving data mining algorithms. Thus, this paper pro- vides the foundations for measurement of the effectiveness of privacy preserving data mining algorithms. Our privacy metrics illustrate some interesting results on the relative ef- fectiveness of different perturbing distributions.

[bib]

@INPROCEEDINGS{Agrawal:01,
  AUTHOR =       "D. Agrawal and C. C. Aggarwal" 
  TITLE =        "On the Design and Quantification of Privacy
                   Preserving Data Mining Algorithms",
  BOOKTITLE =    "Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems",
  YEAR =         "2001",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "247--255",
  address =      "Santa Barbara, CA",
  month =        "",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the Privacy Preserving Properties of Random Data Perturbation Techniques,” Proc. IEEE Int’l Conf. Data Mining, Nov. 2003. [link]

[abstract]

Privacy is becoming an increasingly important issue in many data mining applications. This has triggered the development of many privacy-preserving data mining techniques. A large fraction of them use randomized data distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random value distortion technique in privacy preservation. The paper notes that random objects (particularly random matrices) have "predictable" structures in the spectral domain and it develops a random matrix-based spectral filtering technique to retrieve original data from the dataset distorted by adding random values. The paper presents the theoretical foundation of this filtering method and extensive experimental results to demonstrate that in many cases random data distortion preserve very little data privacy. The paper also points out possible avenues for the development of new privacy-preserving data mining techniques like exploiting multiplicative and colored noise for preserving privacy in data mining applications.

[bib]

@INPROCEEDINGS{Kargupta:03a,
    AUTHOR      = {H. Kargupta and S. Datta and Q. Wang and K. Sivakumar},
    TITLE       = "On the Privacy Preserving Properties of Random Data Perturbation Techniques",
    BOOKTITLE   = {{Proceedings of the IEEE International Conference on Data Mining}},
    YEAR        = {2003},
    EDITOR      = {},
    PAGES       = {99},
    VOLUME      = {},
    NUMBER      = {},
    SERIES      = {},
    ADDRESS     = {Melbourne, FL},
    MONTH       = {November},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {},
    URL         = {},
    ABSTRACT    = {},
}

• K. Muralidhar and R. Sarathy, "Security of Random Data Perturbation Methods," ACM Transactions on Database Systems, Vol. 24, No. 4, December 1999, Pages 487–493. [link]

[abstract]

Statistical databases often use random data perturbation (RDP) methods to protect against disclosure of confidential numerical attributes. One of the key requirements of RDP methods is that they provide the appropriate level of security against snoopers who attempt to obtain information on confidential attributes through statistical inference. In this study, we evaluate the security provided by three methods of perturbation. The results of this study allow the database administrator to select the most effective RDP method that assures adequate protection against disclosure of confidential information.

[bib]

''this is the bibtex entry ...''

• K. Muralidhar and R. Sarathy, "A theoretical basis for perturbation methods," Statistics and Computing 13: 329–335, 2003. [link]

[abstract]

In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.

[bib]

''this is the bibtex entry ...''

• Alexandre Evfimievski, "Randomization in Privacy Preserving Data Mining," ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 43-48, December 2002. [link]

[abstract]

Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

[bib]

''this is the bibtex entry ...''

• A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proc. ACM SIGMOD/PODS Conf., June 2003. [link]

[abstract]

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.

[bib]

@INPROCEEDINGS{Evfimievski:03,
  AUTHOR =       "A. Evfimevski and J. Gehrke and R. Srikant",
  TITLE =        "Limiting Privacy Breaches in Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD/PODS Conference",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "211--222",
  address =      "San Diego, CA",
  month =        "June",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• C. W. Wu, "Privacy Preserving Data Mining: a signal processing perspective and a simple data perturbation protocol," 2nd Workshop on Privacy Preserving Data Mining, IEEE International Conference on Data Mining, Melbourne, FL, 2003. [link]

[abstract]

Privacy concerns over the proliferation of gathering of personal information by various institutions over the internet led to the development of data mining algorithms that preserve the privacy of those whose personal data are collected and analyzed. A novel approach to such privacy preserving data mining algorithms was proposed where the individual datum in a data set is perturbed by adding a random value from a known distribution. In these applications, the distribution of the original data set is important and estimating it is one of the goals of the data mining algorithm. This distribution is estimated via an iterative algorithmsuch as the ExpectationMaximization (EM) algorithmwhich was shown to have desirable properties such as low privacy loss and high fidelity estimates of the distribution. Each iteration of EM requires computation that is proportional to the size of the data set and can require large computation time to estimate the distribution. In this paper we propose two ways to reduce the amount of computation. First, we show that the problem can be recast as a deconvolution problem and signal processing algorithms can be applied to solve this problem. In particular we consider both a direct method and iterative methods which are more robust against noise and ill-conditioning. We show that the Richardson-Lucy deblurring algorithm is equivalent to EM after quantization. The signal processing approach also shows how the choice of perturbation affects information loss and privacy loss and allows us to clarify some points made in the literature.

In the second part of this paper, we propose a scheme for perturbing data which also has the nice properties of arbitrarily small privacy loss and arbitrarily high fidelity in the estimate. The main advantage of the proposed scheme is the simplicity of the estimation algorithm. In contrast to iterative algorithms such as EM, the proposed scheme estimates the unknown distribution in one step. This is significant in applications where the data set is very large or when the data mining algorithm is run in an online environment.

[bib]

''this is the bibtex entry ...''

• Z. Huang and W. Du and B. Chen, "Deriving Private Information from Randomized Data," in Proceedings of the 2005 ACM SIGMOD Conference, pp. 37-48, Baltimroe, MD, 2005. [link]

[abstract]

Randomization has emerged as a useful technique for data disguising in privacy-preserving data mining. Its privacy properties have been studied in a number of papers. Kar- gupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. However, it is still unclear what factors cause such a security breach, how they affect the privacy preserving property of the randomization, and what kinds of data have higher risk of disclosing their private contents even though they are randomized.

We believe that the key factor is the correlations among

attributes. We propose two data reconstruction methods that are based on data correlations. One method uses the Principal Component Analysis (PCA) technique, and the other method uses the Bayes Estimate (BE) technique. We have conducted theoretical and experimental analysis on the relationship between data correlations and the amount of private information that can be disclosed based our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be re- constructed more accurately, i.e., more private information can be disclosed.

To improve privacy, we propose a modified randomization

scheme, in which we let the correlation of random noises "similar" to the original data. Our results have shown that the reconstruction accuracy of both PCA-based and BE- based schemes become worse as the similarity increases.

[bib]

@INPROCEEDINGS{Huang_05,
  AUTHOR =       "Z. Huang and W. Du and B. Chen",
  TITLE =        "Deriving Private Information from Randomized Data",
  BOOKTITLE =    "Proceedings of the 2005 ACM SIGMOD Conference",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "37--48",
  address =      "Baltimroe, MD",
  month =        "June",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S. Guo and X. Wu, "On the Use of Spectral Filtering for Privacy Preserving Data Mining," Proceedings of the 21st ACM Symposium on Applied Computing (SAC06), Dijon, France, April 23-27, pp. 622-626, 2006. [link]

[abstract]

Randomization has been a primary tool to hide sensitive pri- vate information during privacy preserving data mining.The previous work based on spectral ¯ltering, show the noise may be separated from the perturbed data under some conditions and as a result privacy can be seriously compromised. In this paper, we explicitly assess the e®ects of perturbation on the accuracy of the estimated value and give the explicit rela- tion on how the estimation error varies with perturbation. In particular, we derive one upper bound for the Frobenius norm of reconstruction error. This upper bound may be ex- ploited by attackers to determine how close their estimates are from the original data using spectral ¯ltering technique, which imposes a serious threat of privacy breaches.

[bib]

@INPROCEEDINGS{Guo_06,
  AUTHOR =       "S. Guo and X. Wu",
  TITLE =        "On the Use of Spectral Filtering for Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the 21st ACM Symposium on Applied Computing (SAC'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "622--626",
  address =      "Dijon, France",
  month =        "April",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S.Guo, X.Wu and Y. Li, "On the Lower Bound of Reconstruction Error for Spectral Filting," Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD06), Berlin, Germany, Sept 18-22, 2006. [link]

[abstract]

This is the abstract ...

[bib]

@INPROCEEDINGS{,
  AUTHOR =       "S.Guo and X.Wu and Y. Li",
  TITLE =        "On the Lower Bound of Reconstruction Error for Spectral Filting",
  BOOKTITLE =    "Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Berlin, Germany",
  month =        "September",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}


Multiplicative Data Perturbation

• Kun Liu, Hillol Kargupta, and Jessica Ryan, Random Projection-based Multiplicative Perturbation for Privacy Preserving Distributed Data Mining. IEEE Transactions on Knowledge and Data Engineering (TKDE), VOL. 18, NO. 1, pages 92--106, Piscataway, NJ, January 2006. [link]

[abstract]

This paper explores the possibility of using multiplicative random projection matrices for privacy preserving distributed data mining. It specifically considers the problem of computing statistical aggregates like the inner product matrix, correlation coefficient matrix, and Euclidean distance matrix from distributed privacy sensitive data possibly owned by multiple parties. This class of problems is directly related to many other data-mining problems such as clustering, principal component analysis, and classification. This paper makes primary contributions on two different grounds. First, it explores Independent Component Analysis as a possible tool for breaching privacy in deterministic multiplicative perturbation-based models such as random orthogonal transformation and random rotation. Then, it proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data. The paper presents extensive theoretical analysis and experimental results. Experiments demonstrate that the proposed technique is effective and can be successfully used for different types of privacypreserving data mining applications.

[bib]

@ARTICLE{Liu_06a,
    AUTHOR      = {Kun Liu and Hillol Kargupta and Jessica Ryan},
    TITLE       = {{Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining}},
    JOURNAL     = {{IEEE Transactions on Knowledge and Data Engineering (TKDE)}},
    YEAR        = {2006},
    VOLUME      = {18},
    NUMBER      = {1},
    PAGES       = {92--106},
    MONTH       = {January},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {1041-4347},
    URL         = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.14},
    ABSTRACT    = {},
}

• Kun Liu, Chris Giannella and Hillol Kargupta, "An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining," in Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September, 2006. [link]

[abstract]

We examine the effectiveness of distance preserving transformations in privacy preserving data mining. These techniques are potentially very useful in that some important data mining algorithms can be applied to the transformed data and produce exactly the same results as if applied to the original data e.g. distance-based clustering, k-nearest neighbor classification. However, the issue of how well the original data is hidden has, to our knowledge, not been carefully studied. We take a step in this direction by assuming the role of an attacker armed with two types of prior information regarding the original data. We examine how well the attacker can recover the original data from the transformed data and prior information. Our results offer insight into the vulnerabilities of distance preserving transformations.

[bib]

 
@INPROCEEDINGS{Liu_06b,
  AUTHOR =       "Kun Liu and Chris Giannella and Hillol Kargupta",
  TITLE =        "An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the 10th European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD'06)",
  YEAR =         "2006",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Berlin, Germany",
  month =        "September",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Hillol Kargupta, Kun Liu, and Jessica Ryan, "Privacy Sensitive Distributed Data Mining from Multi-party Data," Proceedings of the first NSF/NIJ Symposium on Intelligence and Security Informatics, (Tucson, AZ), June 2003, pp. 336-342. [link]

[abstract]

Privacy is becoming an increasingly important issue in data mining, particularly in security and counter-terrorism-related applications where the data is often sensitive. This paper considers the problem of mining privacy sensitive distributed multi-party data. It specifically considers the problem of computing statistical aggregates like the correlation matrix from privacy sensitive data where the program for computing the aggregates is not trusted by the owner(s) of the data. It presents a brief overview of a random projection-based technique to compute the correlation matrix from a single third-party data site and also multiple homogeneous sites.

[bib]

@INPROCEEDINGS{Kargupta_03,
  AUTHOR =       "Hillol Kargupta and Kun Liu and Jessica Ryan",
  TITLE =        "Privacy Sensitive Distributed Data Mining from Multi-party Data",
  BOOKTITLE =    "Proceedings of the first NSF/NIJ Symposium on Intelligence and Security Informatics",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "Lecture Notes in Computer Science",
  pages =        "336--342",
  address =      "Tucson, AZ",
  month =        "June",
  organization = "",
  publisher =    "Springer Berlin/Heidelberg",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• J.J. Kim and W.E. Winkler, “Multiplicative Noise for Masking Continuous Data,” Technical Report Statistics #2003-01, Statistical Research Division, US Bureau of the Census, Washington D.C., Apr. 2003. [link]

[abstract]

To protect the identity of the persons or firms on a microdata file, noise is sometimes added to the data before releasing it to the public. There has been conjecture that, rather than adding noise, multiplying noise might better protect the confidentiality. Two forms of multiplicative noise are considered. The first approach is generating random numbers which have mean one and small variance, and multiplying the original data by the noise. The second approach is to take a logarithmic transformation, compute a covariance matrix of the transformed data, generate random number which follows mean zero and variance/covariance c times the variance/covariance computed in the previous step, add the noise to the transformed data and take an antilog of the noise added data. This paper investigates the statistical properties of both methods and shows how well they protect the identity of those on the file via reidentification trials.

[bib]

@TECHREPORT{Kim_03,
  AUTHOR =       "J. J. Kim and W. E. Winkler",
  TITLE =        "Multiplicative Noise for Masking Continuous Data",
  INSTITUTION =  "Statistical Research Division, U.S. Bureau of the Census",
  YEAR =         "2003",
  type =         "",
  number =       "Statistics \#2003-01",
  address =      "Washington D.C.",
  month =        "April",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
  file = F
}

• K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), pages 589–592, Houston, TX, November 2005. [link]

[abstract]

This paper presents a random rotation perturbation approach for privacy preserving data classification. Concretely, we identify the importance of classificationspecific information with respect to the loss of information factor, and present a random rotation perturbation framework for privacy preserving data classification. Our approach has two unique characteristics. First, we identify that many classification models utilize the geometric properties of datasets, which can be preserved by geometric rotation. We prove that the three types of classifiers will deliver the same performance over the rotation perturbed dataset as over the original dataset. Second, we propose a multi-column privacy model to address the problems of evaluating privacy quality for multidimensional perturbation. With this metric, we develop a local optimal algorithm to find the good rotation perturbation in terms of privacy guarantee. We also analyze both naive estimation and ICA-based reconstruction attacks with the privacy model. Our initial experiments show that the random rotation approach can provide high privacy guarantee while maintaining zero-loss of accuracy for the discussed classifiers.

[bib]

@INPROCEEDINGS{Chen_05,
  AUTHOR =       "K. Chen and L. Liu",
  TITLE =        "Privacy Preserving Data Classification with Rotation Perturbation",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "589--592",
  address =      "Houston, TX",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• S. R. M. Oliveira and O. R. Zaiane. Privacy preservation when sharing data for clustering. In Proceedings of the International Workshop on Secure Data Management in a Connected World, pages 67–82, Toronto, Canada, August 2004. [link]

[abstract]

In this paper, we address the problem of protecting the underlying attribute values when sharing data for clustering. The challenge is how to meet privacy requirements and guarantee valid clustering results as well. To achieve this dual goal, we propose a novel spatial data transformation method called Rotation-Based Transformation (RBT). The major features of our data transformation are: a) it is independent of any clustering algorithm, b) it has a sound mathematical foundation; c) it is efficient and accurate; and d) it does not rely on intractability hypotheses from algebra and does not require CPU-intensive operations. We show analytically that although the data are transformed to achieve privacy, we can also get accurate clustering results by the safeguard of the global distances between data points.

[bib]

@INPROCEEDINGS{Oliveira_04p,
  AUTHOR =       "S. R. M. Oliveira and O. R. {Za\"{i}ane}",
  TITLE =        "Privacy Preservation When Sharing Data For Clustering",
  BOOKTITLE =    "Proceedings of the International Workshop on Secure Data Management in a Connected World",
  YEAR =         "2004",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "67--82",
  address =      "Toronto, Canada",
  month =        "August",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}




Categorical or General Data Perturbation

• Alexandre Evfimievski, "Randomization in Privacy Preserving Data Mining," ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 43-48, December 2002. [link]

[abstract]

Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. How much distortion is needed to protect privacy can be determined using a privacy measure. Several possible privacy measures are known; finding the best measure is an open question. This paper presents some methods and results in randomization for numerical and categorical data, and discusses the issue of measuring privacy.

[bib]

''this is the bibtex entry ...''

• A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proc. ACM SIGMOD/PODS Conf., June 2003. [link]

[abstract]

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.

[bib]

@INPROCEEDINGS{Evfimievski:03,
  AUTHOR =       "A. Evfimevski and J. Gehrke and R. Srikant",
  TITLE =        "Limiting Privacy Breaches in Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD/PODS Conference",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "211--222",
  address =      "San Diego, CA",
  month =        "June",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}


Data Anonymization

• Latanya Sweeney, "k-anonymity: a model for protecting privacy," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Volume 10, Issue 5, pp. 557-570, 2002. [link]

[abstract]

Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, µ-Argus and k-Similar provide guarantees of privacy protection.

[bib]

@article{774552,
 author = {Latanya Sweeney},
 title = {k-anonymity: a model for protecting privacy},
 journal = {Int. J. Uncertain. Fuzziness Knowl.-Based Syst.},
 volume = {10},
 number = {5},
 year = {2002},
 issn = {0218-4885},
 pages = {557--570},
 doi = {http://dx.doi.org/10.1142/S0218488502001648},
 publisher = {World Scientific Publishing Co., Inc.},
 address = {River Edge, NJ, USA},
 }

• L. Sweeney, "Achieving k-anonymity privacy protection using generalization and suppression," International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), pp. 571-588, 2002. [link]

[abstract]

Often a data holder, such as a hospital or bank, needs to share person-specific records in such a way that the identities of the individuals who are the subjects of the data cannot be determined. One way to achieve this is to have the released records adhere to k- anonymity, which means each released record has at least (k-1) other records in the release whose values are indistinct over those fields that appear in external data. So, k- anonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals even if the records are directly linked to external information. This paper provides a formal presentation of combining generalization and suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a value with a less specific but semantically consistent value. Suppression involves not releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms Datafly and µ-Argus are compared to MinGen. Both Datafly and µ-Argus use heuristics to make approximations, and so, they do not always yield optimal results. It is shown that Datafly can over distort data and µ-Argus can additionally fail to provide adequate protection.

[bib]

@article{774553,
 author = {Latanya Sweeney},
 title = {Achieving k-anonymity privacy protection using generalization and suppression},
 journal = {Int. J. Uncertain. Fuzziness Knowl.-Based Syst.},
 volume = {10},
 number = {5},
 year = {2002},
 issn = {0218-4885},
 pages = {571--588},
 doi = {http://dx.doi.org/10.1142/S021848850200165X},
 publisher = {World Scientific Publishing Co., Inc.},
 address = {River Edge, NJ, USA},
 }

• A. Meyerson and R. Williams, "General k-anonymization is Hard," Carnegie Mellon School of Computer Science Tech Report, 03-113, 2003. [link]

[abstract]

This is the abstract ...

[bib]

''this is the bibtex entry ...''

• K. Wang, P. S. Yu, and S. Chakraborty, "Bottom-up generalization: a data mining solution to privacy protection," in Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 2004. [link]

[abstract]

The well-known privacy-preserved data mining modifies existing data mining techniques to randomized data. In this paper, we investigate data mining as a technique for masking data, therefore, termed data mining based privacy protection. This approach incorporates partially the requirement of a targeted data mining task into the process of masking data so that essential structure is preserved in the masked data. The idea is simple but novel: we explore the data generalization concept from data mining as a way to hide detailed information, rather than discover trends and patterns. Once the data is masked, standard data mining techniques can be applied without modification. Our work demonstrated another positive use of data mining technology: not only can it discover useful patterns, but also mask private information.

We consider the following privacy problem: a data holder wants to release a version of data for building classification models, but wants to protect against linking the released data to an external source for inferring sensitive information. We adapt an iterative bottom-up generalization from data mining to generalize the data. The generalized data remains useful to classification but becomes difficult to link to other sources. The generalization space is specified by a hierarchical structure of generalizations. A key is identifying the best generalization to climb up the hierarchy at each iteration. Enumerating all candidate generalizations is impractical. We present a scalable solution that examines at most one generalization in each iteration for each attribute involved in the linking.

[bib]

@inproceedings{WYC04icdm,
  author = "K. Wang and P. S. Yu and S. Chakraborty",
  title = "Bottom-up generalization: a data mining solution to privacy protection",
  booktitle = "Proc. of the 4th IEEE International Conference on Data Mining (ICDM 2004)",
  month = "November",
  year = "2004",
}

• Roberto J. Bayardo and Rakesh Agrawal, "Data Privacy Through Optimal k-Anonymization," in Proceedings of the 21st International Conference on Data Engineering (ICDE'05) - Volume 00, pp. 217-228, 2005. [link]

[abstract]

Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimalk-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a nontrivial dataset under a general model of the problem.

[bib]

@inproceedings{1054048,
 author = {Roberto J. Bayardo and Rakesh Agrawal},
 title = {Data Privacy through Optimal k-Anonymization},
 booktitle = {ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05)},
 year = {2005},
 isbn = {0-7695-2285-8},
 pages = {217--228},
 doi = {http://dx.doi.org/10.1109/ICDE.2005.42},
 publisher = {IEEE Computer Society},
 address = {Washington, DC, USA},
 }

• Sheng Zhong and Zhiqiang Yang and Rebecca N. Wright, "Privacy-enhancing k-anonymization of customer data," in Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 139--147, 2005. [link]

[abstract]

In order to protect individuals' privacy, the technique of k-anonymization has been proposed to de-associate sensitive attributes from the corresponding identifiers. In this paper, we provide privacy-enhancing methods for creating k-anonymous tables in a distributed scenario. Specifically, we consider a setting in which there is a set of customers, each of whom has a row of a table, and a miner, who wants to mine the entire table. Our objective is to design protocols that allow the miner to obtain a k-anonymous table representing the customer data, in such a way that does not reveal any extra information that can be used to link sensitive attributes to corresponding identifiers, and without requiring a central authority who has access to all the original data. We give two different formulations of this problem, with provably private solutions. Our solutions enhance the privacy of k-anonymization in the distributed scenario by maintaining end-to-end privacy from the original customer data to the final k-anonymous results.

[bib]

@inproceedings{1065185,
 author = {Sheng Zhong and Zhiqiang Yang and Rebecca N. Wright},
 title = {Privacy-enhancing k-anonymization of customer data},
 booktitle = {PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems},
 year = {2005},
 isbn = {1-59593-062-0},
 pages = {139--147},
 location = {Baltimore, Maryland},
 doi = {http://doi.acm.org/10.1145/1065167.1065185},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• B. C. M. Fung, K. Wang, and P. S. Yu, "Top-down specialization for information and privacy preservation," in Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005, pp. 205-216. [link]

[abstract]

Releasing person-specific data in its most specific state poses a threat to individual privacy. This paper presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modelling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. Our approach exploits the fact that data usually contains redundant structures for classification. While generalization may eliminate some structures, other structures emerge to help. Our results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity.

[bib]

@inproceedings{FWY05icde,
  author = "B. C. M. Fung and K. Wang and P. S. Yu",
  title = "Top-Down Specialization for Information and Privacy Preservation",
  booktitle = "Proc. of the 21st IEEE International Conference on Data Engineering (ICDE 2005)",
  pages = "205-216",
  address = "Tokyo, Japan",
  month = "April",
  year = "2005",
}

• K. Wang, B. C. M. Fung, and G. Dong, "Integrating private databases for data analysis," in Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI 2005), Atlanta, GA, May 2005, pp. 171-182. [link]

[abstract]

In today’s globally networked society, there is a dual demand on both information sharing and information protection. A typical scenario is that two parties wish to integrate their private databases to achieve a common goal beneficial to both, provided that their privacy requirements are satisfied. In this paper, we consider the goal of building a classifier over the integrated data while satisfying the k-anonymity privacy requirement. The k-anonymity requirement states that domain values are generalized so that each value of some specified attributes identifies at least k records. The generalization process must not leak more specific information other than the final integrated data. We present a practical and efficient solution to this problem.

[bib]

@inproceedings{WFD05isi,
  author = "K. Wang and B. C. M. Fung and G. Dong",
  title = "Integrating Private Databases for Data Analysis",
  booktitle = "Proc. of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI 2005)",
  series = "Lecture Notes in Computer Science",
  volume = "3495",
  pages = "171-182",
  address = "Atlanta, GA",
  month = "May",
  year = "2005",
}

• K. Wang, B. C. M. Fung, and P. S. Yu, "Template-based privacy preservation in classification problems," in Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, TX, November 2005, pp. 466-473. [link]

[abstract]

In this paper, we present a template-based privacy preservation to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted classification analysis and limit the usefulness of unwanted sensitive inferences that may be derived from the data. Sensitive inferences are specified by a set of "privacy templates". Each template speci es the sensitive information to be protected, a set of identifying attributes, and the maximum association between the two. We show that suppressing the domain values is an effective way to eliminate sensitive inferences. For a large data set, finding an optimal suppression is hard, since it requires optimization over all suppressions. We present an approximate but scalable solution. We demonstrate the effectiveness of this approach on real life data sets.

[bib]

@inproceedings{WFY05icdm,
  author = "K. Wang and B. C. M. Fung and P. S. Yu",
  title = "Template-Based Privacy Preservation in Classification Problems",
  booktitle = "Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005)",
  address = "Houston, TX",
  pages = "466-473",
  month = "November",
  year = "2005",
}

• M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, "k-Anonymous Patterns," In Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'05) Lecture Notes in Computer Science, Volume 3721, Springer. October 3-7, 2005, Porto, Portugal. [link]

[abstract]

It is generally believed that data mining results do not violate the anonymity of the individuals recorded in the source database. In fact, data mining models and patterns, in order to ensure a required statistical significance, represent a large number of individuals and thus conceal individual identities: this is the case of the minimum support threshold in association rule mining. In this paper we show that this belief is ill-founded. By shifting the concept of k-anonymity from data to patterns, we formally characterize the notion of a threat to anonymity in the context of pattern discovery, and provide a methodology to ef ciently and effectively identify all possible such threats that might arise from the disclosure of a set of extracted patterns.

[bib]

@INPROCEEDINGS{Atzori_05a,
  AUTHOR =       "M. Atzori, F. Bonchi, F. Giannotti and D. Pedreschi",
  TITLE =        "k-Anonymous Patterns",
  BOOKTITLE =    "Proceedings of the Ninth European Conference on Principles
                  and Practice of Knowledge Discovery in Databases (PKDD'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "3721",
  number =       "",
  series =       "Lecture Notes in Computer Science, Springer",
  pages =        "",
  address =      "Porto, Portugal",
  month =        "October",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, "Blocking Anonymity Threats Raised by Frequent Itemset Mining," In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05), IEEE. November 27-30, 2005, New Orleans, Louisiana. [link]

[abstract]

In this paper we study when the disclosure of data min- ing results represents, per se, a threat to the anonymity of the individuals recorded in the analyzed database. The novelty of our approach is that we focus on an objective definition of privacy compliance of patterns without any reference to a preconceived knowledge of what is sensi- tive and what is not, on the basis of the rather intuitive and realistic constraint that the anonymity of individu- als should be guaranteed. In particular, the problem ad- dressed here arises from the possibility of inferring from the output of frequent itemset mining (i.e., a set of item- sets with support larger than a threshold \sigma), the exis- tence of patterns with very low support (smaller than an anonymity threshold k). In the following we develop a simple methodology to block such inference opportuni- ties by introducing distortion on the dangerous patterns.

[bib]

@INPROCEEDINGS{Atzori_05b,
  AUTHOR =       "M. Atzori, F. Bonchi, F. Giannotti and D. Pedreschi",
  TITLE =        "Blocking Anonymity Threats Raised by Frequent Itemset Mining",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan Venkitasubramaniam, "l -Diversity: Privacy Beyond k-Anonymity," p. 24, 22nd International Conference on Data Engineering (ICDE'06), 2006. [link]

[abstract]

Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k−1 other records with respect to certain “identifying” attributes. In this paper we show with two simple attacks that a k-anonymized dataset has some subtle, but severe privacy problems. First, we show that an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks and we propose a novel and powerful privacy defi- nition called l-diversity. In addition to building a formal foundation for l-diversity, we show in an experimental evaluation that ℓ-diversity is practical and can be implemented efficiently.

[bib]

@inproceedings{Machanavajjhala_06,
  author    = {Ashwin Machanavajjhala and
               Johannes Gehrke and
               Daniel Kifer and
               Muthuramakrishnan Venkitasubramaniam},
  title     = {l-Diversity: Privacy Beyond k-Anonymity.},
  booktitle = {Proceedings of the 22nd International Conference on Data
               Engineering (ICDE'06)},
  year      = {2006},
  pages     = {24},
  ee        = {http://doi.ieeecomputersociety.org/10.1109/ICDE.2006.1},
  crossref  = {DBLP:conf/icde/2006},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

• K. Wang, B. C. M. Fung, and P. S. Yu, "Handicapping attacker's confidence: an alternative to k-anonymization," in Knowledge and Information Systems: An International Journal (KAIS), 2006. [link]

[abstract]

We present an approach of limiting the con dence of inferring sensitive properties to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted data analysis request and limit the usefulness of unwanted sensitive inferences that may be derived from the release of data. Sensitive inferences are speci ed by a set of "privacy templates". Each template speci es the sensitive property to be protected, the attributes identifying a group of individuals, and a maximum threshold for the con dence of inferring the sensitive property given the identifying attributes. We show that suppressing the domain values monotonically decreases the maximum con dence of such sensitive inferences. Hence, we propose a data transformation that minimally suppresses the domain values in the data to satisfy the set of privacy templates. The transformed data is free of sensitive inferences even in the presence of data mining algorithms. The prior k-anonymization focuses on personal identities. This work focuses on the association between personal identities and sensitive properties.

[bib]

@article{WFY06kais,
  author = "K. Wang and B. C. M. Fung and P. S. Yu",
  title = "Handicapping Attacker's Confidence: An Alternative to k-Anonymization",
  journal = "Knowledge and Information Systems: An International Journal (KAIS)",
  publisher = "Springer-Verlag",
  year = "2006",
}


Data Swapping

• S.E. Fienberg and J. McIntyre, “Data Swapping: Variations on a Theme by Dalenius and Reiss,” technical report, Nat’l Inst. of Statistical Sciences, Research Triangle Park, NC, 2003.[link]

[abstract]

Data swapping, a term introduced in 1978 by Dalenius and Reiss for a new method of statistical disclosure protection in confidential data bases, has taken on new meanings and been linked to new statistical methodologies over the intervening twenty-five years. This paper revisits the original (1982) published version of the the Dalenius-Reiss data swapping paper and then traces the developments of statistical disclosure limitation methods that can be thought of as rooted in the original concept. The emphasis here, as in the original contribution, is on both disclosure protection and the release of statistically usable data bases.

[bib]

@TECHREPORT{Fienberg_03,
  AUTHOR =       "S. E. Fienberg and J. McIntyre",
  TITLE =        "Data Swapping: Variations on a Theme by Dalenius and Reiss",
  INSTITUTION =  "National Institute of Statistical Sciences",
  YEAR =         "2003",
  type =         "",
  number =       "",
  address =      "Research Triangle Park, NC",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
  file = F
}

• Shanti Gomatam, Alan F. Karr and Ashish P. Sanil, "Data Swapping as a Decision Problem," 2004. [link]

[abstract]

We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and possibly, constraints on the unswapped attributes. Risk–utility frontiers, consisting of those candidates not dominated in (risk, utility) space by any other candidate, are a principal tool for reducing the scale of the decision problem. Multiple measures of disclosure risk and data utility, including utility measures based directly on use of the swapped data for statistical inference, are introduced. Their behavior and resulting insights into the decision problem are illustrated using data from the Current Population Survey, the well-studied “Czech auto worker data” and data on schools and administrators generated by the National Center for Education Statistics.

[bib]

''this is the bibtex entry ...''



Randomized Response

• Andris Ambainis and Markus Jakobsson and Helger Lipmaa, "Cryptographic Randomized Response Techniques," Cryptology ePrint Archive: Report 2003/027. [link]

[abstract]

We develop cryptographically secure techniques to guarantee unconditional privacy for respondents to polls. Our constructions are efficient and practical, and are shown not to allow cheating respondents to affect the "tally" by more than their own vote - which will be given the exact same weight as that of other respondents. We demonstrate solutions to this problem based on both traditional cryptographic techniques and quantum cryptography.

[bib]

@misc{cryptoeprint:2003:027,
    author = {Andris Ambainis and Markus Jakobsson and Helger Lipmaa},
    title = {Cryptographic Randomized Response Techniques},
    howpublished = {Cryptology ePrint Archive, Report 2003/027},
    year = {2003},
    note = {\url{http://eprint.iacr.org/}},
}

• Wenliang Du and Zhijun Zhan, "Using Randomized Response Techniques for Privacy-Preserving Data Mining," SIDKDD 2003. [link]

[abstract]

Privacy is an important issue in data mining and knowledge discovery. In this paper, we propose to use the randomized response techniques to conduct the data mining computation. Specially, we present a method to build decision tree classifiers from the disguised data. We conduct experiments to compare the accuracy ofou r decision tree with the one built from the original undisguised data. Our results show that although the data are disguised, our method can still achieve fairly high accuracy. We also show how the parameter used in the randomized response techniques affects the accuracy ofth e results.

[bib]

''this is the bibtex entry ...''



Cryptographic/Secure Multi-Party Computation (SMC)

• B. Pinkas, “Cryptographic Techniques for Privacy Preserving Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12-19, 2002. [link]

[abstract]

Research in secure distributed computation, which was done as part of a larger body of research in the theory of cryptography, has achieved remarkable results. It was shown that non-trusting parties can jointly compute functions of their different inputs while ensuring that no party learns anything but the defined output of the function. These results were shown using generic constructions that can be applied to any function that has an efficient representation as a circuit. We describe these results, discuss their efficiency, and demonstrate their relevance to privacy preserving computation of data mining algorithms. We also show examples of secure computation of data mining algorithms that use these generic constructions.

[bib]

@ARTICLE{Pinkas_02,
  AUTHOR =       "B. Pinkas",
  TITLE =        "Cryptographic Techniques for Privacy Preserving Data Mining",
  JOURNAL =      "SIGKDD Explorations",
  YEAR =         "2002",
  volume =       "4",
  number =       "2",
  pages =        "12--19 ",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
  source =       "",
}

• M.J. Atallah and W. Du, “Secure Multi-Party Computational Geometry,” Proc. WADS2001: Seventh Int’l Workshop on Algorithms and Data Structures, pp. 165-179, Aug. 2001. [link]

[abstract]

The general secure multi-party computation problem is when multiple parties (say, Alice and Bob) each have private data (respectively, a and b) and seek to compute some function f(a; b) without revealing to each other anything unintended (i.e., anything other than what can be inferred from knowing f(a; b)). It is well known that, in theory, the general secure multi-party computation problem is solvable using circuit evaluation protocols. While this approach is appealing in its general- ity, the communication complexity of the resulting protocols depend on the size of the circuit that expresses the functionality to be computed. As Goldreich has recently pointed out [6], using the solutions derived from these general results to solve speci c problems can be impractical; problem-speci c solutions should be developed, for eÆciency reasons. This paper is a rst step in this direction for the area of computational geometry. We give simple solutions to some speci c geometric problems, and in doing so we develop some building blocks that we believe will be useful in the solution of other geometric and combinatorial problems as well.

[bib]

@INPROCEEDINGS{Atallah_01,
  AUTHOR =       "M. J. Atallah and W. Du",
  TITLE =        "Secure Multi-Party Computational Geometry",
  BOOKTITLE =    "WADS2001: Seventh International Workshop on Algorithms and Data Structures",
  YEAR =         "2001",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "165--179",
  address =      "Providence, Rhode Island",
  month =        "August",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• W. Du and M. J. Atallah, "Secure MultiParty Computation Problems and Their Applications: A Review and Open Problems," in Proceedings of the 2001 Workshop on New Security Paradigms, pp. 13-22, 2001. [link]

[abstract]

The growth of the Internet has triggered tremendous opportunities for cooperative computation, where people are jointly conducting computation tasks based on the private inputs they each supplies. These computations could occur between mutually untrusted parties, or even between competitors. For example, customers might send to a remote database queries that contain private information; two competing nancial organizations might jointly invest in a project that must satisfy both organizations' private and valuable constraints, and so on. Today, to conduct such computations, one entity must usually know the inputs from all the participants; however if nobody can be trusted enough to know all the inputs, privacy will become a primary concern.

This problem is referred to as Secure Multi-party Computation Problem (SMC) in the literature. Research in the SMC area has been focusing on only a limited set of specific SMC problems, while privacy concerned cooperative computations call for SMC studies in a variety of computation domains. Before we can study the problems, we need to identify and define the specific SMC problems for those computation domains. We have developed a framework to facilitate this problem-discovery task. Based on our framework, we have identified and de ned a number of new SMC problems for a spectrum of computation domains. Those problems include privacy-preserving database query, privacy-preserving scientific computations, privacy-preserving intrusion detection, privacy-preserving statistical analysis, privacy-preserving geometric computations, and privacy-preserving data mining.

The goal of this paper is not only to present our results, but also to serve as a guideline so other people can identify useful SMC problems in their own computation domains.

[bib]

@inproceedings{Du:01a,
    author      = {W. Du and M. J. Atallah},
    title       = {Secure Multi-Party Computation Problems and Their
                   Applications: A Review and Open Problems},
    booktitle   = {Proceedings of the 2001 Workshop on New Security Paradigms},
    pages       = {13-22},
    month       = {September},
    address     = {Cloudcroft, NM},
    year        = 2001,
    publisher   = "ACM Press"
}

• O. Goldreich, The Foundations of Cryptography, vol. 2, chapter 7. Cambridge Univ. Press, 2004. [link]

[abstract]

The current volume is the second part (or volume) of the two-volume work Foundations of Cryptography. The first part (i.e., Volume 1) consists of an introductory chapter (Chapter 1), followed by chapters on computational difficulty (one-way functions), pseudorandomness and zero-knowledge proofs (Chapters 2-4, respectively). The current volume consists of three chapters, which treat encryption schemes (Chapter 5), signature schemes (Chapter 6), and general cryptographic protocols (Chapter 7), respectively. Also included is an appendix that provides corrections and additions to Volume 1.

[bib]

@INBOOK{Goldreich_04,
  AUTHOR =       "O. Goldreich",
  editor =       "",
  TITLE =        "The Foundations of Cryptography",
  CHAPTER =      "7",
  pages =        "",
  PUBLISHER =    "Cambridge University Press",
  YEAR =         "2004",
  volume =       "2",
  number =       "",
  series =       "",
  type =         "",
  address =      "",
  edition =      "",
  month =        "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• Joan Feigenbaum and Yuval Ishai and Tal Malkin and Kobbi Nissim and Martin J. Strauss and Rebecca N. Wright, "Secure Multiparty Computation of Approximations," [link]

[abstract]

Approximation algorithms can sometimes provide efficient solutions when no efficient exact computation is known. In particular, approximations are often useful in a distributed setting where the inputs are held by di erent parties and are extremely large. Furthermore, for some applications, the parties want to cooperate to compute a function of their inputs without revealing more information than necessary. If \hat{f} is an approximation to f, secure multiparty computation of \hat{f} allows the parties to compute \hat{f} without revealing unnecessary information. However, secure computation of \hat{f} may not be as private as secure computation of f, because the output of \hat{f} may itself reveal more information than the output of f. In this paper, we present definitions of secure multiparty approximate computations that retain the privacy of a secure computation of f. We present an efficient, sublinear-communication, private approximate computation for the Hamming distance and an efficient private approximation of the permanent.

[bib]

''this is the bibtex entry ...''

• Wenliang Du, "A Study Of Several Specific Secure Two-Party Computation Problems," Ph.D. thesis, 2001. [link]

[abstract]

Alice has a private input $x$ (of any data type, such as a number, a matrix or a data set). Bob has another private input $y$. Alice and Bob want to cooperatively conduct a specific computation on $x$ and $y$ without disclosing to the other person any information about her or his private input except for what could be derived from the results. This problem is a Secure Two-party Computation (STC) problem, which has been extensively studied in the past. Several generic solutions have been proposed to solve the general STC problem; however the generic solutions are often too inefficient to be practical. Therefore, in this dissertation, we study several specific STC problems with the goal of finding more efficient solutions than the generic ones. We introduce a number of specific STC problems in the domains of scientific computation, statistical analysis, computational geometry and database query. Most of the problems have not been studied before in the literature. To solve these problems: (1) We investigate how data perturbation could be used to hide data. Data perturbation hides a datum by adding to it a random number. We show that this technique is effective in preserving privacy. (2) We explore how domain specific knowledge can improve the efficiency of the solutions that we develop over the generic solutions that do not consider domain specific knowledge. We show that such knowledge is important in both hiding data and achieving higher efficiency. (3) We also introduce a number of common building blocks that are useful in solving secure two-party computation problems in various computation domains.

[bib]

@PhdThesis{DuThesis2001,
  author =      {W. Du},
  title =       {A Study of Several Specific Secure Two-party Computation Problems},
  school =      {Purdue University},
  address =     {West Lafayette, Indiana},
  year =        2001
}

• Wenliang Du, Zhijun Zhang, "A Practical Approach to Solve Secure Multi-party Computation," in NSPW '02: Proceedings of the 2002 workshop on New security paradigms, pp. 127-135, 2002. [link]

[abstract]

Secure Multi-party Computation (SMC) problems deal with the following situation: Two (or many) parties want to jointly perform a computation. Each party needs to con- tribute its private input to this computation, but no party should disclose its private inputs to the other parties, or to any third party. With the proliferation of the Internet, SMC problems becomes more and more important. So far no practical solution has emerged, largely because SMC studies have been focusing on zero information disclosure, an ideal security model that is expensive to achieve.

Aiming at developing practical solutions to SMC problems,

we propose a new paradigm, in which we use an acceptable security model that allows partial information disclosure. Our conjecture is that by lowering the restriction on the security, we can achieve a much better performance. The paradigm is motivated by the observation that in practice people do accept a less secure but much more e cient solu- tion because sometimes disclosing information about their private data to certain degree is a risk that many people would rather take if the performance gain is so significant. Moreover, in our paradigm, the security is adjustable, such that users can adjust the level of security based on their de nition of the acceptable security. We have developed a number of techniques under this new paradigm, and are currently conducting extensive studies based on this new paradigm.

[bib]

@inproceedings{844125,
 author = {Wenliang Du and Zhijun Zhan},
 title = {A practical approach to solve Secure Multi-party Computation problems},
 booktitle = {NSPW '02: Proceedings of the 2002 workshop on New security paradigms},
 year = {2002},
 isbn = {1-58113-598-X},
 pages = {127--135},
 location = {Virginia Beach, Virginia},
 doi = {http://doi.acm.org/10.1145/844102.844125},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Andris Ambainis and Markus Jakobsson and Helger Lipmaa, "Cryptographic Randomized Response Techniques," Cryptology ePrint Archive: Report 2003/027. [link]

[abstract]

We develop cryptographically secure techniques to guarantee unconditional privacy for respondents to polls. Our constructions are efficient and practical, and are shown not to allow cheating respondents to affect the "tally" by more than their own vote - which will be given the exact same weight as that of other respondents. We demonstrate solutions to this problem based on both traditional cryptographic techniques and quantum cryptography.

[bib]

@misc{cryptoeprint:2003:027,
    author = {Andris Ambainis and Markus Jakobsson and Helger Lipmaa},
    title = {Cryptographic Randomized Response Techniques},
    howpublished = {Cryptology ePrint Archive, Report 2003/027},
    year = {2003},
    note = {\url{http://eprint.iacr.org/}},
}

• Bart Goethals, Sven Laur, Helger Lipmaa and Taneli Mielikäinen. On Private Scalar Product Computation for Privacy-Preserving Data Mining. The 7th Annual International Conference in Information Security and Cryptology (ICISC 2004), volume 3506 of Lecture Notes in Computer Science, pages 104--120, 2004. [link]

[abstract]

In mining and integrating data from multiple sources, there are many privacy and security issues. In several different contexts, the security of the full privacy-preserving data mining protocol depends on the security of the underlying private scalar product protocol. We show that two of the private scalar product protocols, one of which was proposed in a leading data mining conference, are insecure.We then describe a provably private scalar product protocol that is based on homomorphic encryption and improve its efficiency so that it can also be used on massive datasets.

[bib]

''this is the bibtex entry ...''




Privacy Preserving Classification

• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data, pp. 439-450, May 2000. [link]

[abstract]

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.

[bib]

@INPROCEEDINGS{Agrawal:00,
  AUTHOR =       "R. Agrawal and R. Srikant",
  TITLE =        "Privacy Preserving Data Mining",
  BOOKTITLE =    "Proceedings of the ACM SIGMOD Conference on Management of Data",
  YEAR =         "2000",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "439--450",
  address =      "Dallas, TX",
  month =        "May",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Y. Lindell and B. Pinkas, "Privacy Preserving Data Mining," Advances in Cryptology (CRYPTO'00), pp. 36--53, 2000. [link]

[abstract]

In this paper we address the issue of privacy preserving data mining. Specifically, we consider a scenario in which two parties owning confidential databases wish to run a data mining algorithm on the union of their databases, without revealing any unnecessary information. Our work is motivated by the need to both protect privileged information and enable its use for research or other purposes. The above problem is a specific example of secure multi-party computation and as such, can be solved using known generic protocols. However, data mining algorithms are typically complex and, furthermore, the input usually consists of massive data sets. The generic protocols in such a case are of no practical use and therefore more efficient protocols are required. We focus on the problem of decision tree learning with the popular ID3 algorithm. Our protocol is considerably more efficient than generic solutions and demands both very few rounds of communication and reasonable bandwidth.

[bib]

@INPROCEEDINGS{Lindell:00,
  AUTHOR =       "Y. Lindell and B. Pinkas",
  TITLE =        "Privacy Preserving Data Mining",
  BOOKTITLE =    "Advances in Cryptology (CRYPTO'00)",
  YEAR =         "2000",
  editor =       "",
  volume =       "1880",
  number =       "",
  series =       "Lecture Notes in Computer Science",
  pages =        "36--53",
  address =      "",
  month =        "",
  organization = "",
  publisher =    "Springer-Verlag",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• W. Du and Z. Zhan, “Building Decision Tree Classifier on Private Data,” Proc. IEEE Int’l Conf. Privacy, Security, and Data Mining, pp. 1-8, Dec. 2002. [link]

[abstract]

This paper studies how to build a decision tree classifier under the following scenario:a database is vertically partitioned into two pieces, with one piece owned by Alice and the other piece owned by Bob. Alice and Bob want to build a decision tree classi- fier based on such a database, but due to the privacy constraints, neither of them wants to disclose their private pieces to the other party or to any third party. We present a protocol that allows Alice and Bob to conduct such a classifier building without having to compromise their privacy. Our protocol uses an untrusted third-party server, and is built upon a useful building block, the scalar product protocol. Our solution to the scalar product protocol is more efficient than any existing solutions.

[bib]

@inproceedings{Du:02,
    author      = {W. Du and Z. Zhan},
    title       = {Building Decision Tree Classifier on Private Data},
    booktitle   = {Proceedings of the IEEE International Conference on Privacy, Security and Data Mining},
    month       = {December},
    address     = {Maebashi City, Japan},
    year        = 2002,
    pages       = "1--8",
    publisher   = "Australian Computer Society, Inc.",
}

• Wenliang Du and Zhijun Zhan, "Using Randomized Response Techniques for Privacy-Preserving Data Mining," SIDKDD 2003. [link]

[abstract]

Privacy is an important issue in data mining and knowledge discovery. In this paper, we propose to use the randomized response techniques to conduct the data mining computation. Specially, we present a method to build decision tree classifiers from the disguised data. We conduct experiments to compare the accuracy ofou r decision tree with the one built from the original undisguised data. Our results show that although the data are disguised, our method can still achieve fairly high accuracy. We also show how the parameter used in the randomized response techniques affects the accuracy ofth e results.

[bib]

''this is the bibtex entry ...''

• Zhiqiang Yang, Sheng Zhong, Rebecca N. Wright, "Privacy-Preserving Classification of Customer Data without Loss of Accuracy," In Proc. of the 2005 SIAM International Conference on Data Mining (SDM) 2005. [link]

[abstract]

Privacy has become an increasingly important issue in data mining. In this paper, we consider a scenario in which a data miner surveys a large number of customers to learn classification rules on their data, while the sensitive attributes of these customers need to be protected. Solutions have been proposed to address this problem using randomization techniques. Such solutions exhibit a tradeoff of accuracy and privacy: the more each customer's private information is protected, the less accurate result the miner obtains; conversely, the more accurate the result, the less privacy for the customers. In this paper, we propose a simple cryptographic approach that is efficient even in a many-customer setting, provides strong privacy for each customer, and does not lose any accuracy as the cost of privacy. Our key technical contribution is a privacy-preserving method that allows a data miner to compute frequencies of values or tuples of values in the customers' data, without revealing the privacy-sensitive part of the data. Unlike general-purpose cryptographic protocols, this method requires no interaction between customers, and each customer only needs to send a single flow of communication to the data miner. However we are still able to ensure that nothing about the sensitive data beyond the desired frequencies is revealed to the data miner. To illustrate the power of our approach, we use our frequency mining computation to obtain a privacy-preserving naive Bayes classifier learning algorithm. Initial experimental results demonstrate the practical efficiency of our solution. We also suggest some other applications of privacy-preserving frequency mining.

[bib]

''this is the bibtex entry ...''

• Hwanjo Yu, Xiaoqian Jiang, Jaideep Vaidya, "Privacy-Preserving SVM using Nonlinear Kernels on Horizontally Partitioned Data," Proc. of ACM SAC Data Mining Track, 2006. [link]

[abstract]

Traditional Data Mining and Knowledge Discovery algorithms assume free access to data, either at a centralized location or in federated form. Increasingly, privacy and security concerns restrict this access, thus derailing data mining projects. What is required is distributed knowledge discovery that is sensitive to this problem. The key is to obtain valid results, while providing guarantees on the non-disclosure of data. Support vector machine classification is one of the most widely used classification methodologies in data mining and machine learning. It is based on solid theoretical foundations and has wide practical application. This paper proposes a privacy-preserving solution for support vector machine (SVM) classification, PP-SVM for short. Our solution constructs the global SVM classification model from the data distributed at multiple parties, without disclosing the data of each party to others. We assume that data is horizontally partitioned – each party collects the same features of information for different data objects. We quantify the security and efficiency of the proposed method, and highlight future challenges.

[bib]

''this is the bibtex entry ...''

• K. Chen and L. Liu. Privacy preserving data classification with rotation perturbation. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), pages 589–592, Houston, TX, November 2005. [link]

[abstract]

This paper presents a random rotation perturbation approach for privacy preserving data classification. Concretely, we identify the importance of classificationspecific information with respect to the loss of information factor, and present a random rotation perturbation framework for privacy preserving data classification. Our approach has two unique characteristics. First, we identify that many classification models utilize the geometric properties of datasets, which can be preserved by geometric rotation. We prove that the three types of classifiers will deliver the same performance over the rotation perturbed dataset as over the original dataset. Second, we propose a multi-column privacy model to address the problems of evaluating privacy quality for multidimensional perturbation. With this metric, we develop a local optimal algorithm to find the good rotation perturbation in terms of privacy guarantee. We also analyze both naive estimation and ICA-based reconstruction attacks with the privacy model. Our initial experiments show that the random rotation approach can provide high privacy guarantee while maintaining zero-loss of accuracy for the discussed classifiers.

[bib]

@INPROCEEDINGS{Chen_05,
  AUTHOR =       "K. Chen and L. Liu",
  TITLE =        "Privacy Preserving Data Classification with Rotation Perturbation",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "589--592",
  address =      "Houston, TX",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}



Privacy Preserving Association Rule/Frequent Itemsets Mining

• Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke, "Privacy Preserving Mining of Association Rules," SIGKDD 2002. [link]

[abstract]

We present a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy using a straightforward "uniform" randomization, the discovered rules can unfortunately be exploited to find privacy breaches. We analyze the nature of privacy breaches and propose a class of randomization operators that are much more effective than uniform randomization in limiting the breaches. We derive formulae for an unbiased support estimator and its variance, which allow us to recover itemset supports from randomized datasets, and show how to incorporate these formulae into mining algorithms. Finally, we present experimental results that validate the algorithm by applying it on real datasets.

[bib]

''this is the bibtex entry ...''

• S. J. Rizvi and J. R. Haritsa. "Maintaining data privacy in association rule mining," In Proceedings of the 28th VLDB Conference, Hong Kong, China, August 2002. [link]

[abstract]

Data mining services require accurate input data for their results to be meaningful, but privacy concerns may inuence users to provide spurious information. We investigate here, with respect to mining association rules, whether users can be encouraged to provide correct information by ensuring that the mining process cannot, with any reasonable degree of certainty, violate their privacy. We present a scheme, based on probabilistic distortion of user data, that can simultaneously provide a high degree of privacy to the user and retain a high level of accuracy in the mining results. The performance of the scheme is validated against representative real and synthetic datasets.

[bib]

@InProceedings{Rizvi:02,
  author =       {S. J. Rizvi and J. R. Haritsa},
  title =        {Maintaining Data Privacy in Association Rule Mining},
  booktitle =    {Proceedings of the 28th VLDB Conference},
  year =         {2002},
  address =      {Hong Kong, China},
  month =        {August},
}

• Murat Kantarcioglu, Chris Clifton, "Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 9, pp. 1026-1037, Sept., 2004. [link]

[abstract]

Data mining can extract important knowledge from large data collections—but sometimes these collections are split among various parties. Privacy concerns may prevent the parties from directly sharing the data and some types of information about the data. This paper addresses secure mining of association rules over horizontally partitioned data. The methods incorporate cryptographic techniques to minimize the information shared, while adding little overhead to the mining task.

[bib]

@INPROCEEDINGS{Kantarcioglu:02,
  AUTHOR =       "M. Kantarcioglu and C. Clifton",
  TITLE =        "Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data",
  BOOKTITLE =    "ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'02)",
  YEAR =         "2002",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "",
  month =        "June",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• J. S. Vaidya and C. Clifton, "Privacy preserving association rule mining in vertically partitioned data," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. [link]

[abstract]

Privacy considerations often constrain data mining projects. This paper addresses the problem of association rule mining where transactions are distributed across sources. Each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. However, the sites must not reveal individual transaction data. We present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values.

[bib]

@inproceedings{Vaidya:02,
    author      = {J. S. Vaidya and C. Clifton},
    title       = {Privacy Preserving Association Rule Mining in Vertically Partitioned Data},
    booktitle   = {Proceedings of the Eighth ACM SIGKDD International Conference
                   on Knowledge Discovery and Data Mining},
    address     = {Edmonton, Canada},
    month       = {July},
    year        = {2002},
}

• Chunhua Suy and Kouichi Sakurai, "Secure Computation Over Distributed Databases," 2005. [link]

[abstract]

Association rule mining over distributed databases is a very useful technique in data mining domain. In doing association rule mining in multi-client's dataset, there are some serious concerns on privacy and security. Such concerns may prevent the parties from directly sharing their data. Prior randomization approaches to this problem generally occur some traddoff between privacy and accuracy. In this paper, we present a secure multi-party computation scheme which can do association rule mining of multi-client's dataset while strictly respecting their privacy with maintaining the accuracy of the results.

[bib]

''this is the bibtex entry ...''

• M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, "Blocking Anonymity Threats Raised by Frequent Itemset Mining," In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05), IEEE. November 27-30, 2005, New Orleans, Louisiana. [link]

[abstract]

In this paper we study when the disclosure of data min- ing results represents, per se, a threat to the anonymity of the individuals recorded in the analyzed database. The novelty of our approach is that we focus on an objective definition of privacy compliance of patterns without any reference to a preconceived knowledge of what is sensi- tive and what is not, on the basis of the rather intuitive and realistic constraint that the anonymity of individu- als should be guaranteed. In particular, the problem ad- dressed here arises from the possibility of inferring from the output of frequent itemset mining (i.e., a set of item- sets with support larger than a threshold \sigma), the exis- tence of patterns with very low support (smaller than an anonymity threshold k). In the following we develop a simple methodology to block such inference opportuni- ties by introducing distortion on the dangerous patterns.

[bib]

@INPROCEEDINGS{Atzori_05b,
  AUTHOR =       "M. Atzori, F. Bonchi, F. Giannotti and D. Pedreschi",
  TITLE =        "Blocking Anonymity Threats Raised by Frequent Itemset Mining",
  BOOKTITLE =    "Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "",
  month =        "November",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}



Privacy Preserving Clustering

• J. Vaidya and C. Clifton, "Privacy-Preserving K-Means Clustering over Vertically Partitioned Data," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003. [link]

[abstract]

Privacy and security concerns can prevent sharing of data, derailing data mining projects. Distributed knowledge discovery, if done correctly, can alleviate this problem. The key is to obtain valid results, while providing guarantees on the (non)disclosure of data. We present a method for k-means clustering when different sites contain different attributes for a common set of entities. Each site learns the cluster of each entity, but learns nothing about the attributes at other sites.

[bib]

@INPROCEEDINGS{Vaidya:03,
    AUTHOR      = "J. Vaidya and C. Clifton",
    TITLE       = "Privacy-Preserving K-Means Clustering over Vertically Partitioned Data",
    BOOKTITLE   = {{The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}},
    YEAR        = {2003},
    EDITOR      = {},
    PAGES       = {206--215},
    VOLUME      = {},
    NUMBER      = {},
    SERIES      = {},
    ADDRESS     = {Washington, D.C.},
    MONTH       = {August},
    NOTE        = {},
    KEY         = {},
    KEYWORDS    = {},
    ISBN        = {},
    URL         = {},
    ABSTRACT    = {},
}

• S. Merugu and J. Ghosh, "Privacy-Preserving Distributed Clustering Using Generative Models," Proceedings of the Third IEEE International Conference on Data Mining ICDM 2003. [link]

[abstract]

We present a framework for clustering distributed data in unsupervised and semi-supervised scenarios, taking into account privacy requirements and communication costs. Rather than sharing parts of the original or perturbed data, we instead transmit the parameters of suitable generative models built at each local data site to a central location. We mathematically show that the best representative of all the data is a certain “ mean” model, and empirically show that this model can be approximated quite well by generating artificial samples from the underlying distributions using Markov Chain Monte Carlo techniques, and then fitting a combined global model with a chosen parametric form to these samples. We also propose a new measure that quantifies privacy based on information theoretic concepts, and show that decreasing privacy leads to a higher quality of the combined model and vice versa. We provide empirical results on different data types to highlight the generality of our framework. The results show that high quality distributed clustering can be achieved with little privacy loss and low communication cost.

[bib]

@INPROCEEDINGS{Merugu_03,
  AUTHOR =       "S. Merugu and J. Ghosh",
  TITLE =        "Privacy-Preserving Distributed Clustering Using Generative Models",
  BOOKTITLE =    "Proceedings of the Third IEEE International Conference on Data Mining ICDM'03",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Melbourne, FL",
  month =        "November",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• S. R. M. Oliveira and O. R. Zaïane, "Privacy Preservation When Sharing Data For Clustering," in Proceedings of the International Workshop on Secure Data Management in a Connected World, 2004. [link]

[abstract]

In this paper, we address the problem of protecting the underlying attribute values when sharing data for clustering. The challenge is how to meet privacy requirements and guarantee valid clustering results as well. To achieve this dual goal, we propose a novel spatial data transformation method called Rotation-Based Transformation (RBT). The major features of our data transformation are: a) it is independent of any clustering algorithm, b) it has a sound mathematical foundation; c) it is efficient and accurate; and d) it does not rely on intractability hypotheses from algebra and does not require CPU-intensive operations. We show analytically that although the data are transformed to achieve privacy, we can also get accurate clustering results by the safeguard of the global distances between data points.

[bib]

@INPROCEEDINGS{Oliveira_04p,
  AUTHOR =       "S. R. M. Oliveira and O. R. {Za\"{i}ane}",
  TITLE =        "Privacy Preservation When Sharing Data For Clustering",
  BOOKTITLE =    "Proceedings of the International Workshop on Secure Data Management in a Connected World",
  YEAR =         "2004",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "67--82",
  address =      "Toronto, Canada",
  month =        "August",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• M.Klusch, S. Lodi, and G. Moro, "Distributed Clustering Based on Sampling Local Density Estimates," IJCAI 2003.[link]

[abstract]

Huge amounts of data are stored in autonomous, geographically distributed sources. The discovery of previously unknown, implicit and valuable knowledge is a key aspect of the exploitation of such sources. In recent years several approaches to knowledge discovery and data mining, and in particular to clustering, have been developed, but only a few of them are designed for distributed data sources. We propose a novel distributed clustering algorithm based on non-parametric kernel density estimation, which takes into account the issues of privacy and communication costs that arise in a distributed environment.

[bib]

''this is the bibtex entry ...''

• Geetha Jagannathan and Rebecca N. Wright, "Privacy-preserving distributed k-means clustering over arbitrarily partitioned data," in Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005. [link]

[abstract]

Advances in computer networking and database technologies have enabled the collection and storage of vast quantities of data. Data mining can extract valuable knowledge from this data, and organizations have realized that they can often obtain better results by pooling their data together. However, the collected data may contain sensitive or private information about the organizations or their customers, and privacy concerns are exacerbated if data is shared between multiple organizations.Distributed data mining is concerned with the computation of models from data that is distributed among multiple participants. Privacy-preserving distributed data mining seeks to allow for the cooperative computation of such models without the cooperating parties revealing any of their individual data items. Our paper makes two contributions in privacy-preserving data mining. First, we introduce the concept of arbitrarily partitioned data, which is a generalization of both horizontally and vertically partitioned data. Second, we provide an efficient privacy-preserving protocol for k-means clustering in the setting of arbitrarily partitioned data.

[bib]

@inproceedings{1081942,
 author = {Geetha Jagannathan and Rebecca N. Wright},
 title = {Privacy-preserving distributed k-means clustering over arbitrarily partitioned data},
 booktitle = {KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining},
 year = {2005},
 isbn = {1-59593-135-X},
 pages = {593--599},
 location = {Chicago, Illinois, USA},
 doi = {http://doi.acm.org/10.1145/1081870.1081942},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Geetha Jagannathan, Krishnan Pillaipakkamnatt and Rebecca N. Wright, "A New Privacy-Preserving Distributed k-Clustering Algorithm," in Proceedings of the 2006 SIAM International Conference on Data Mining (SDM), 2006. [link]

[abstract]

We present a simple I/O-efficient k-clustering algo- rithm that was designed with the goal of enabling a privacy-preserving version of the algorithm. Our ex- periments show that this algorithm produces cluster centers that are, on average, more accurate than the ones produced by the well known iterative k-means al- gorithm. We use our new algorithm as the basis for a communication-efficient privacy-preserving k-clustering protocol for databases that are horizontally partitioned between two parties. Unlike existing privacy-preserving protocols based on the k-means algorithm, this protocol does not reveal intermediate candidate cluster centers.

[bib]

''this is the bibtex entry ...''



Privacy Preserving Bayes Classifier/Bayesian Network

• M. Kantarcoglu and J. Vaidya, "Privacy Preserving Naive Bayes Classifier for Horizontally Partitioned Data," In IEEE ICDM Workshop on Privacy Preserving Data Mining, pp. 3-9, 2003. [link]

[abstract]

The problem of secure distributed classification is an important one. In many situations, data is split between multiple organizations. These organizations may want to utilize all of the data to create more accurate predictive models while revealing neither their training data / databases nor the instances to be classified. The Naive Bayes Classifier is a simple but efficient baseline classifier. In this paper, we present a privacy preserving Naive Bayes Classifier for horizontally partitioned data.

[bib]

@INPROCEEDINGS{Kantarcoglu_03icdm,
  AUTHOR =       "M. Kantarcoglu and J. Vaidya",
  TITLE =        "Privacy Preserving Naive Bayes Classifier for Horizontally Partitioned Data",
  BOOKTITLE =    "IEEE ICDM Workshop on Privacy Preserving Data Mining",
  YEAR =         "2003",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "3--9",
  address =      "Melbourne, FL",
  month =        "November",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• D. Meng, K. Sivakumar, and H. Kargupta, "Privacy Sensitive Bayesian Network Parameter Learning," in Proceedings of the Fourth IEEE International Conference on Data Mining. Brighton, UK, pages 427-430, 2004. [link]

[abstract]

This paper considers the problem of learning the pa- rameters of a Bayesian Network, assuming the struc- ture of the network is given, from a privacy-sensitive dataset that is distributed between multiple parties. For a binary-valued dataset, we show that the count infor- mation required to estimate the conditional probabilities in a Bayesian network can be obtained as a solution to a set of linear equations involving some inner product between the relevant different feature vectors. We con- sider a random projection-based method that was pro- posed elsewhere to securely compute the inner product (with a modified implementation of that method).

[bib]

''this is the bibtex entry ...''

• Rebecca Wright and Zhiqiang Yang, "Privacy-preserving Bayesian network structure computation on distributed heterogeneous data," in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , 2004. [link]

[abstract]

As more and more activities are carried out using computers and computer networks, the amount of potentially sensitive data stored by business, governments, and other parties increases. Different parties may wish to benefit from cooperative use of their data, but privacy regulations and other privacy concerns may prevent the parties from sharing their data. Privacy-preserving data mining provides a solution by creating distributed data mining algorithms in which the underlying data is not revealed.In this paper, we present a privacy-preserving protocol for a particular data mining task: learning the Bayesian network structure for distributed heterogeneous data. In this setting, two parties owning confidential databases wish to learn the structure of Bayesian network on the combination of their databases without revealing anything about their data to each other. We give an efficient and privacy-preserving version of the K2 algorithm to construct the structure of a Bayesian network for the parties' joint data.

[bib]

@inproceedings{1014145,
 author = {Rebecca Wright and Zhiqiang Yang},
 title = {Privacy-preserving Bayesian network structure computation on distributed heterogeneous data},
 booktitle = {KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining},
 year = {2004},
 isbn = {1-58113-888-1},
 pages = {713--718},
 location = {Seattle, WA, USA},
 doi = {http://doi.acm.org/10.1145/1014052.1014145},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Zhiqiang Yang, Rebecca N. Wright, "Improved Privacy-Preserving Bayesian Network Parameter Learning on Vertically Partitioned Data," in Pre-Proceedings of International Workshop on Privacy Data Management, 2005. Held in conjunction with ICDE 2005. [link]

[abstract]

Privacy concerns often prevent different parties from sharing their data in order to carry out data mining applications on their joint data. Privacy-preserving data mining seeks to address this by enabling parties to jointly compute a data mining algorithm on distributed data without sharing their data. In this paper, we address a particular data mining problem, that of learning the parameters of Bayesian network on a vertically partitioned database. We provide a simple privacy-preserving protocol for learning the parameters of Bayesian network on vertically partitioned databases. In comparison to the previously known solution for this problem (Meng, Sivakumar, and Kargupta, 2004), our solution provides better performance, full privacy, and complete accuracy. In combination with our previous work on privacy-preserving learning of Bayesian network structure on vertically partitioned databases, this work provides a complete privacy-preserving protocol for learning Bayesian networks (both structure and parameters) on vertically partitioned data, with very little overhead beyond computing the structure alone.

[bib]

''this is the bibtex entry ...''



Privacy Preserving Multivariate Statistical Analysis

• W. Du, Y.S. Han, and S. Chen, “Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification,” Proc. 2004 SIAM Int’l Conf. Data Mining (SDM04), Apr. 2004. [link]

[abstract]

Multivariate statistical analysis is an important data analysis technique that has found applications in var- ious areas. In this paper, we study some multivariate statistical analysis methods in Secure 2-party Compu- tation (S2C) framework illustrated by the following sce- nario: two parties, each having a secret data set, want to conduct the statistical analysis on their joint data, but neither party is willing to disclose its private data to the other party or any third party. The current sta- tistical analysis techniques cannot be used directly to support this kind of computation because they require all parties to send the necessary data to a central place. In this paper, We define two Secure 2-party multivariate statistical analysis problems: Secure 2-party Multivari- ate Linear Regression problem and Secure 2-party Mul- tivariate Classification problem. We have developed a practical security model, based on which we have devel- oped a number of building blocks for solving these two problems.

[bib]

@INPROCEEDINGS{Du_04,
  AUTHOR =       "W. Du and Y. S. Han and S. Chen",
  TITLE =        "Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification",
  BOOKTITLE =    "Proceedings of 2004 SIAM International Conference on Data Mining (SDM04)",
  YEAR =         "2004",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Lake Buena Vista, FL",
  month =        "April",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• W. Du, M. Atallah, "Privacy-Preserving Cooperative Statistical Analysis," acsac, p. 0102, 17th Annual Computer Security Applications Conference (ACSAC'01), 2001. [link]

[abstract]

The growth of the Internet opens up tremendous opportunities for cooperative computation, where the answer depends on the private inputs of separate entities. Sometimes these computations may occur between mutually untrusting entities. The problem is trivial if the context allows the conduct of these computations by a trusted entity that would know the inputs from all the participants; however if the context disallows this then the techniques of secure multiparty computation become very relevant and can provide useful solutions. Statistic analysis is a widely used computation in real life, but the known methods usually require one to know the whole data set; little work has been conducted to investigate how statistical analysis could be performed in a cooperative environment, where the participants want to conduct statistical analysis on the joint data set, but each participant is concerned about the confidentiality of its own data. In this paper we have developed protocols for conducting the statistic analysis in such kind of cooperative environment based on a data perturbation technique and cryptography primitives.

[bib]

''this is the bibtex entry ...''

• Hiranmayee Subramaniam, Rebecca N. Wright, Zhiqiang Yang, ",Experimental Analysis of Privacy-Preserving Statistics Computation," In Proceedings of the Workshop on Secure Data Management, held in conjunction with VLDB '04, August 30, 2004. [link]

[abstract]

The recent investigation of privacy-preserving data mining and other kinds of privacy-preserving distributed computation has been motivated by the growing concern about the privacy of individuals when their data is stored, aggregated, and mined for information. Building on the study of selective private function evaluation and the efforts to- wards practical algorithms for privacy-preserving data mining solutions, we analyze and implement solutions to an important primitive, that of computing statistics of selected data in a remote database in a privacy- preserving manner. We examine solutions in different scenarios ranging from a high speed communications medium, such as a LAN or high- speed Internet connection, to a decelerated communications medium to account for worst-case communication delays such as might be provided in a wireless multihop setting. Our experimental results show that in the absence of special-purpose hardware accelerators or practical optimizations, the computational com- plexity is the performance bottleneck of these solutions rather than the communication complexity. We also evaluate several practical optimiza- tions to amortize the computation time and to improve the practical efficiency.

[bib]

''this is the bibtex entry ...''



Privacy Information Retrieval and Database Application

• W. Du and M. J. Atallah, "Protocols for Secure Remote Database Access with Approximate Matching," 7th ACM Conference on Computer and Communications Security(ACMCCS 2000), 2000. [link]

[abstract]

Suppose that Bob has a database and that Alice wants to perform a search query q on D (e.g.,“is q in D ?”). Since Alice is concerned about her privacy, she does not want Bob to know the query q or the response to the query. How could this be done? There are elegant cryptographic techniques for solving this problem under various constraints (such as “Bob should know neither q nor the answer to the query” and “Alice should learn nothing about D other than the answer to the query”), while optimizing various performance criteria (e.g., amount of communication). We consider the version of this problem where the query is of the type “is q approximately in D ?” for a number of different notions of “approximate”, some of which arise in image processing and template matching, while others arise in biological sequence comparisons. New techniques are needed in this framework of approximate searching, because each notion of “approximate equality” introduces its own set of difficulties; using encryption is more problematic in this framework because the items that are approximately equal cease to be so after encryption or cryptographic hashing. Practical protocols for solving such problems make possible new forms of e-commerce between proprietary database owners and customers who seek to query the database, with privacy. We first present four secure remote database access models that could be used in an e-commerce framework, each of which has different privacy requirement. We then present our solutions for achieving privacy in each of these four models.

[bib]

@INPROCEEDINGS{Du_00,
  AUTHOR =       "W. Du and M. J. Atallah",
  TITLE =        "Protocols for Secure Remote Database Access with Approximate Matching",
  BOOKTITLE =    "7th ACM Conference on Computer and Communications Security(ACMCCS 2000).
                  The first workshop on Security of Privacy in E-Commerce",
  YEAR =         "2000",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Athens, Greece",
  month =        "November",
  organization = "",
  note =         "",
  abstract =     "",
  keywords =     "",
}

• Yan-Cheng Chang and Michael Mitzenmacher, "Privacy Preserving Keyword Searches on Remote Encrypted Data," Cryptology ePrint Archive: Report 2004/051 . [link]

[abstract]

We consider the following problem: a user \U\ wants to store his files in an encrypted form on a remote file server \FS. Later the user \U\ wants to efficiently retrieve some of the encrypted files containing (or indexed by) specific keywords, keeping the keywords themselves secret and not jeopardizing the security of the remotely stored files. For example, a user may want to store old e-mail messages encrypted on a server managed by Yahoo or another large vendor, and later retrieve certain messages while traveling with a mobile device.

In this paper, we offer solutions for this problem under well-defined security requirements. Our schemes are efficient in the sense that no public-key cryptosystem is involved. Indeed, our approach is independent of the encryption method chosen for the remote files. They are also incremental, in that \U\ can submit new files which are totally secure against previous queries but still searchable against future queries.

[bib]

''this is the bibtex entry ...''

• Xintao Wu, Yongge Wang and Yuliang Zheng, "Privacy preserving database application testing," Proceedings of the 2003 ACM workshop on Privacy in the electronic society, pp. 118-128, 2003. [link]

[abstract]

Traditionally, application software developers carry out their tests on their own local development databases. However, such local databases usually have only a small number of sample data and hence cannot simulate satisfactorily a live environment, especially in terms of performance and scalability testing. On the other hand, the idea of testing applications over live production databases is increasingly problematic in most situations primarily due to the fact that such use of live production databases has the potential to expose sensitive data to an unauthorized tester and to incorrectly update information in the underlying database. In this paper, we investigate techniques to generate mock databases for application software testing without revealing any confidential information from the live production databases. Specifically, we will design mechanisms to create the deterministic rule set R, non-deterministic rule set N R, and statistic data set S for a live production database. We will then build a security Analyzer which will process the triplet <R',N R',S'> together with security requirements (security policy) and output a new triplet <R',N R',S'> The security Analyzer will guarantee that no confidential information could be inferred from the new triplet <R',N R',S'> The mock database generated from this new triplet can simulate the live environment for testing purpose, while maintaining the privacy of data in the original database.

[bib]

@inproceedings{1005159,
 author = {Xintao Wu and Yongge Wang and Yuliang Zheng},
 title = {Privacy preserving database application testing},
 booktitle = {WPES '03: Proceedings of the 2003 ACM workshop on Privacy in the electronic society},
 year = {2003},
 isbn = {1-58113-776-1},
 pages = {118--128},
 location = {Washington, DC},
 doi = {http://doi.acm.org/10.1145/1005140.1005159},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Rakesh Agrawal, Alexandre Evfimievski and Ramakrishnan Srikant, "Information sharing across private databases," SIGMOD 2003. [link]

[abstract]

Literature on information integration across databases tacitly assumes that the data in each database can be revealed to the other databases. However, there is an increasing need for sharing information across autonomous entities in such a way that no information apart from the answer to the query is revealed. We formalize the notion of minimal information sharing across private databases, and develop protocols for intersection, equijoin, intersection size, and equijoin size. We also show how new applications can be built using the proposed protocols.

[bib]

@inproceedings{872771,
 author = {Rakesh Agrawal and Alexandre Evfimievski and Ramakrishnan Srikant},
 title = {Information sharing across private databases},
 booktitle = {SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data},
 year = {2003},
 isbn = {1-58113-634-X},
 pages = {86--97},
 location = {San Diego, California},
 doi = {http://doi.acm.org/10.1145/872757.872771},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Irit Dinur and Kobbi Nissim,"Revealing information while preserving privacy," Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 202-210, 2003. [link] 1;2c1;2c1;2c

[abstract]

We examine the tradeoff between privacy and usability of statistical databases. We model a statistical database by an n-bit string d1,..,dn, with a query being a subset q ⊆ [n] to be answered by Σiεq di. Our main result is a polynomial reconstruction algorithm of data from noisy (perturbed) subset sums. Applying this reconstruction algorithm to statistical databases we show that in order to achieve privacy one has to add perturbation of magnitude (Ω√n). That is, smaller perturbation always results in a strong violation of privacy. We show that this result is tight by exemplifying access algorithms for statistical databases that preserve privacy while adding perturbation of magnitude Õ(√n).For time-T bounded adversaries we demonstrate a privacypreserving access algorithm whose perturbation magnitude is ≈ √T.

[bib]

@inproceedings{773173,
 author = {Irit Dinur and Kobbi Nissim},
 title = {Revealing information while preserving privacy},
 booktitle = {PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems},
 year = {2003},
 isbn = {1-58113-670-6},
 pages = {202--210},
 location = {San Diego, California},
 doi = {http://doi.acm.org/10.1145/773153.773173},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith and Hoeteck Wee, "Toward Privacy in Public Databases," Proceedings of the 2nd Theory of Cryptography Conference (TCC'05), 2005. [link]

[abstract]

We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Unlike in the study of secure function evaluation, in which privacy is preserved to the extent possible given a specific functionality goal, in the census problem privacy is paramount; intuitively, things that cannot be learned “safely” should not be learned at all.

An important contribution of this work is a definition of privacy (and privacy compromise) for statistical databases, together with a method for describing and comparing the privacy offered by specific sanitization techniques.We obtain several privacy results using two different sanitization techniques, and then show how to combine them via cross training. We also obtain two utility results involving clustering.

[bib]

@INPROCEEDINGS{SChawla_05,
  AUTHOR =       "S. Chawla and C. Dwork and F. McSherry",
  TITLE =        "Toward Privacy in Public Databases",
  BOOKTITLE =    "Proceedings of the 2nd Theory of Cryptography Conference (TCC'05)",
  YEAR =         "2005",
  editor =       "",
  volume =       "",
  number =       "",
  series =       "",
  pages =        "",
  address =      "Cambridge MA",
  month =        "February",
  organization = "",
  publisher =    "",
  note =         "",
  abstract =     "",
  keywords =     "",
  file = F
}

• Cynthia Dwork and Kobbi Nissim, "Privacy-Preserving Data Mining in Vertically Partitioned Databases," Crypto 2004. [link]

[abstract]

In a recent paper Dinur and Nissim considered a statistical database in which a trusted database administrator monitors queries and introduces noise to the response with the goal of maintaining data privacy. Under a rigorous definition of breach of privacy, Dinur and Nissim proved that unless that total number of queries is sub-linear in the size of the database, a substantial amount of noise is required to avoid a breach, rendering the database almost useless.

As databases grow increasingly large, the possibility of being able to query only a sublinear number of times becomes realistic. We further investigate this situation, generalizing the previous work in two important directions: multi-attribute databases (previous work dealt only with single-attribute databases) and vertically partitioned databases, in which different subsets of attributes are stored in different databases. In addition, we show how to use our techniques for data mining on published noisy statistics.

[bib]

''this is the bibtex entry ...''

• Shuchi Chawla and Cynthia Dwork and Frank McSherry and Kunal Talwar, "On the Utility of Privacy-Preserving Histograms," UAI, 2005. [link]

[abstract]

In a census, individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Note that this framework is inherently noninteractive. Recently, Chawla et al. (TCC’2005) initiated a theoretical study of the census problem and presented an intuitively appealing definition of privacy breach, called isolation, together with a formal specification of what is required from a data sanitization algorithm: access to the sanitized data should not increase an adversary’s ability to isolate any individual. They also showed that if the data are drawn uniformly from a highdimensional hypercube then recursive histogram sanitization can preserve privacy with a high probability. We extend these results in several ways. First, we develop a method for computing a privacy-preserving histogram sanitization of “round” distributions, such as the uniform distribution over a high-dimensional ball or sphere. This problem is quite challenging because, unlike for the hypercube, the natural histogram over such a distribution may have long and thin cells that hurt the proof of privacy. We then develop techniques for randomizing the histogram constructions both for the hypercube and the hypersphere. These permit us to apply known results for approximating various quantities of interest (e.g., cost of the minimum spanning tree, or the cost of an optimal solution to the facility location problem over the data points) from histogram counts – in a privacy-preserving fashion.

[bib]

''this is the bibtex entry ...''

• Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, Yirong Xu, "Implementing P3P Using Database Technology," icde, p. 595, 19th International Conference on Data Engineering (ICDE'03), 2003. [link]

[abstract]

Platform for Privacy Preferences (P3P) is the most sig-nificant effort currently underway to enable web users to gain control over their private information. P3P pro-vides mechanisms for web site owners to express their privacy policies in a standard format that a user can programmatically check against her privacy preferences to decide whether to release her data to the web site. We discuss architectural alternatives for implementing P3P and present a server-centric implementation that reuses database querying technology, as opposed to the prevailing client-centric implementations based on specialized engines. Not only does the proposed implementation have qualitative advantages, our experiments indicate that it performs significantly better than the sole public-domain client-centric implementation and that the latency introduced by preference matching is small enough for real-world deployments of P3P.

[bib]

''this is the bibtex entry ...''

• Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu, "Hippocratic Databases," Proceedings of the 28th VLDB Conference, 2002. [link]

[abstract]

The Hippocratic Oath has guided the conduct of physicians for centuries. Inspired by its tenet of preserving privacy, we argue that future database systems must include responsibility for the privacy of data they manage as a founding tenet. We enunciate the key privacy principles for such Hippocratic database systems. We propose a strawman design for Hippocratic databases, identify the technical challenges and problems in designing such databases, and suggest some approaches that may lead to solutions. Our hope is that this paper will serve to catalyze a fruitful and exciting direction for future database research.

[bib]

''this is the bibtex entry ...''

• M. Freedman, K. Nissim and B. Pinkas, "Efficient Private Matching and Set Intersection Advances in Cryptology," Eurocrypt '2004 Proceedings, LNCS 3027, Springer-Verlag, pp. 1-19, May 2004. [link]

[abstract]

We consider the problem of computing the intersection of private datasets of two parties, where the datasets contain lists of ele- ments taken from a large domain. This problem has many applications for online collaboration. We present protocols, based on the use of ho- momorphic encryption and balanced hashing, for both semi-honest and malicious environments. For lists of length k, we obtain O(k) communi- cation overhead and O(k ln ln k) computation. The protocol for the semi- honest environment is secure in the standard model, while the protocol for the malicious environment is secure in the random oracle model. We also consider the problem of approximating the size of the intersection, show a linear lower-bound for the communication overhead of solving this problem, and provide a suitable secure protocol. Lastly, we inves- tigate other variants of the matching problem, including extending the protocol to the multi-party setting as well as considering the problem of approximate matching.

[bib]

''this is the bibtex entry ...''

• Omer Barkol and Yuval Ishai, "Secure Computation of Constant Depth Circuits with Applications to Database Search Problems," 25th Annual International Cryptology Conference, 2005. [link]

[abstract]

This is the abstract ...

[bib]

''this is the bibtex entry ...''

• Aggelos Kiayias and Antonina Mitrofanova, "Testing Disjointness of Private Datasets," Financial Cryptography and Data Security: 9th International Conference, FC 2005. [link]

[abstract]

Two parties, say Alice and Bob, possess two sets of elements that belong to a universe of possible values and wish to test whether these sets are disjoint or not. In this paper we consider the above problem in the setting where Alice and Bob wish to disclose no information to each other about their sets beyond the single bit: “whether the intersection is empty or not.” This problem has many applications in commercial settings where two mutually distrustful parties wish to decide with minimum possible disclosure whether there is any overlap between their private datasets. We present three protocols that solve the above problem that meet different efficiency and security objectives and data representation scenarios. Our protocols are based on Homomorphic encryption and in our security analysis, we consider the semi-honest setting as well as the malicious setting. Our most efficient construction for a large universe in terms of overall communication complexity uses a new encryption primitive that we introduce called “superposed encryption.” We formalize this notion and provide a construction that may be of independent interest. For dealing with the malicious adversarial setting we take advantage of recent efficient constructions of Universally-Composable commitments based on verifiable encryption as well as zero-knowledge proofs of language membership.

[bib]

''this is the bibtex entry ...''


Privacy Preserving Collaborative Filtering

• John Canny, "Collaborative filtering with privacy via factor analysis," Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002. [link]

[abstract]

Collaborative filtering (CF) is valuable in e-commerce, and for direct recommendations for music, movies, news etc. But today's systems have several disadvantages, including privacy risks. As we move toward ubiquitous computing, there is a great potential for individuals to share all kinds of information about places and things to do, see and buy, but the privacy risks are severe. In this paper we describe a new method for collaborative filtering which protects the privacy of individual data. The method is based on a probabilistic factor analysis model. Privacy protection is provided by a peer-to-peer protocol which is described elsewhere, but outlined in this paper. The factor analysis approach handles missing data without requiring default values for them. We give several experiments that suggest that this is most accurate method for CF to date. The new algorithm has other advantages in speed and storage over previous algorithms. Finally, we suggest applications of the approach to other kinds of statistical analyses of survey or questionaire data.

[bib]

@inproceedings{564419,
 author = {John Canny},
 title = {Collaborative filtering with privacy via factor analysis},
 booktitle = {SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval},
 year = {2002},
 isbn = {1-58113-561-0},
 pages = {238--245},
 location = {Tampere, Finland},
 doi = {http://doi.acm.org/10.1145/564376.564419},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }


• Huseyin Polat, Wenliang Du, "Privacy-Preserving Collaborative Filtering Using Randomized Perturbation Techniques," icdm, p. 625, Third IEEE International Conference on Data Mining (ICDM'03), 2003. [link]

[abstract]

Collaborative Filtering (CF) techniques are becoming increasingly popular with the evolution of the Internet. To conduct collaborative filtering, data from customers are needed. However, collecting high quality data from customers is not an easy task because many customers are so concerned about their privacy that they might decide to give false information. We propose a randomized perturbation (RP) technique to protect users' privacy while still producing accurate recommendations.

[bib]

''this is the bibtex entry ...''



Privacy Preserving Data Stream Mining

• Constructions and Practical Applications for Private Stream Searching (Extended Abstract). John Bethencourt, Dawn Song, and Brent Waters. To appear at the 27th IEEE Symposium on Security and Privacy (Oakland), May 2006. [link]

[abstract]

A system for private stream searching allows a client to retrieve documents matching some search criteria from a remote server while the server evaluating the re- quest remains provably oblivious to the search criteria. In this extended abstract, we give a high level outline of a new scheme for this problem and an experimental analysis of its scalability. The new scheme is highly e cient in practice. We demonstrate the practical ap- plicability of the scheme by considering its performance in the demanding scenario of providing a privacy pre- serving version of the Google News Alerts service.

[bib]

''this is the bibtex entry ...''

• Rafail Ostrovsky, William Skeith, "Private Searching on Streaming Data," Preliminary version in Proceedings of Advances in Cryptology, CRYPTO-2005. [link]

[abstract]

In this paper, we consider the problem of private searching on streaming data, where we can efficiently implement searching for documents under a secret criteria (such as presence or absence of a hidden combination of hidden keywords) under various cryptographic assumptions. Our results can be viewed in a variety of ways: as a generalization of the notion of a Private Information Retrieval (to the more general queries and to a streaming environment as well as to public-key program obfuscation); as positive results on privacy-preserving datamining; and as a delegation of hidden program computation to other machines.

[bib]

''this is the bibtex entry ...''


Privacy in P2P or Large-scale Distributed Environments

• Bobi Gilburd and Assaf Schuster and Ran Wolff, "k-TTP: a new privacy model for large-scale distributed environments," Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 563-568, 2004. [link]

[abstract]

Secure multiparty computation allows parties to jointly compute a function of their private inputs without revealing anything but the output. Theoretical results [2] provide a general construction of such protocols for any function. Protocols obtained in this way are, however, inefficient, and thus, practically speaking, useless when a large number of participants are involved.The contribution of this paper is to define a new privacy model -- k-privacy -- by means of an innovative, yet natural generalization of the accepted trusted third party model. This allows implementing cryptographically secure efficient primitives for real-world large-scale distributed systems.As an example for the usefulness of the proposed model, we employ k-privacy to introduce a technique for obtaining knowledge -- by way of an association-rule mining algorithm -- from large-scale Data Grids, while ensuring that the privacy is cryptographically secure.

[bib]

@inproceedings{1014120,
 author = {Bobi Gilburd and Assaf Schuster and Ran Wolff},
 title = {k-TTP: a new privacy model for large-scale distributed environments},
 booktitle = {KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining},
 year = {2004},
 isbn = {1-58113-888-1},
 pages = {563--568},
 location = {Seattle, WA, USA},
 doi = {http://doi.acm.org/10.1145/1014052.1014120},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }



Hiding Sensitive Rules

• Yücel Saygin and Vassilios S. Verykios and Chris Clifton, "Using unknowns to prevent discovery of association rules," ACM SIGMOD Record Volume 30, Issue 4, pp. 45-54, 2001. [link]

[abstract]

Data mining technology has given us new capabilities to identify correlations in large data sets. This introduces risks when the data is to be made public, but the correlations are private. We introduce a method for selectively removing individual values from a database to prevent the discovery of a set of rules, while preserving the data for other applications. The efficacy and complexity of this method are discussed. We also present an experiment showing an example of this methodology.

[bib]

@article{604271,
 author = {Y\&\#252;cel Saygin and Vassilios S. Verykios and Chris Clifton},
 title = {Using unknowns to prevent discovery of association rules},
 journal = {SIGMOD Rec.},
 volume = {30},
 number = {4},
 year = {2001},
 issn = {0163-5808},
 pages = {45--54},
 doi = {http://doi.acm.org/10.1145/604264.604271},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Ayça Azgin Hintoglu, Ali Inan, Yücel Saygin, Mehmet Keskinöz: Suppressing Data Sets to Prevent Discovery of Association Rules," ICDM 2005. [link]

[abstract]

This is the abstract ...

[bib]

@inproceedings{DBLP:conf/icdm/HintogluISK05,
  author    = {Ay\c{c}a Azgin Hintoglu and
               Ali Inan and
               Y{\"u}cel Saygin and
               Mehmet Keskin{\"o}z},
  title     = {Suppressing Data Sets to Prevent Discovery of Association
               Rules.},
  booktitle = {ICDM},
  year      = {2005},
  pages     = {645-648},
  ee        = {http://doi.ieeecomputersociety.org/10.1109/ICDM.2005.140},
  crossref  = {DBLP:conf/icdm/2005},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

• Lecture Notes In Computer Science; Vol. 2137, Proceedings of the 4th International Workshop on Information Hiding, Pages: 369 - 383, 2001. [link]

[abstract]

Large repositories of data contain sensitive information which must be protected against unauthorized access. The protection of the confidentiality of this information has been a long-term goal for the database security research community and the government statistical agencies. Recent advances, in data mining and machine learning algorithms, have increased the disclosure risks one may encounter when releasing data to outside parties.

[bib]

@inproceedings{731877,
 author = {Elena Dasseni and Vassilios S. Verykios and Ahmed K. Elmagarmid and Elisa Bertino},
 title = {Hiding Association Rules by Using Confidence and Support},
 booktitle = {IHW '01: Proceedings of the 4th International Workshop on Information Hiding},
 year = {2001},
 isbn = {3-540-42733-3},
 pages = {369--383},
 publisher = {Springer-Verlag},
 address = {London, UK},
 }

• Stanley R.M. Oliveira, Osmar R. Zaïane, "Privacy Preserving Frequent Itemset Mining," in Proceedings of the IEEE international conference on Privacy, security and data mining - Volume 14, pp. 43-54, 2002. [link]

[abstract]

One crucial aspect of privacy preserving frequent itemset min- ing is the fact that the mining process deals with a trade-o : privacy and accuracy, which are typically contradictory, and improving one usually incurs a cost in the other. One al- ternative to address this particular problem is to look for a balance between hiding restrictive patterns and disclosing non- restrictive ones. In this paper, we propose a new framework for enforcing privacy in mining frequent itemsets. We combine, in a single framework, techniques for e ciently hiding restrictive patterns: a transaction retrieval engine relying on an inverted le and Boolean queries; and a set of algorithms to \sanitize" a database. In addition, we introduce performance measures for mining frequent itemsets that quantify the fraction of min- ing patterns which are preserved after sanitizing a database. We also report the results of a performance evaluation of our research prototype and an analysis of the results.

[bib]

@inproceedings{850789,
 author = {Stanley R. M. Oliveira and Osmar R. Za\&\#239;ane},
 title = {Privacy preserving frequent itemset mining},
 booktitle = {CRPITS'14: Proceedings of the IEEE international conference on Privacy, security and data mining},
 year = {2002},
 isbn = {0-909-92592-5},
 pages = {43--54},
 location = {Maebashi City, Japan},
 publisher = {Australian Computer Society, Inc.},
 address = {Darlinghurst, Australia, Australia},
 }

• Vassilios S. Verykios and Ahmed K. Elmagarmid and Elisa Bertino and Yucel Saygin and Elena Dasseni, "Association Rule Hiding," IEEE Transactions on Knowledge and Data Engineering archive Volume 16, Issue 4, pp.434-447, 2004 [link]

[abstract]

Large repositories of data contain sensitive information that must be protected against unauthorized access. The protection of the confidentiality of this information has been a long-term goal for the database security research community and for the government statistical agencies. Recent advances in data mining and machine learning algorithms have increased the disclosure risks that one may encounter when releasing data to outside parties. A key problem, and still not sufficiently investigated, is the need to balance the confidentiality of the disclosed data with the legitimate needs of the data users. Every disclosure limitation method affects, in some way, and modifies true data values and relationships. In this paper, we investigate confidentiality issues of a broad category of rules, the association rules. In particular, we present three strategies and five algorithms for hiding a group of association rules, which is characterized as sensitive. One rule is characterized as sensitive if its disclosure risk is above a certain privacy threshold. Sometimes, sensitive rules should not be disclosed to the public since, among other things, they may be used for inferring sensitive data, or they may provide business competitors with an advantage. We also perform an evaluation study of the hiding algorithms in order to analyze their time complexity and the impact that they have in the original database.

[bib]

@article{972279,
 author = {Vassilios S. Verykios and Ahmed K. Elmagarmid and Elisa Bertino and Yucel Saygin and Elena Dasseni},
 title = {Association Rule Hiding},
 journal = {IEEE Transactions on Knowledge and Data Engineering},
 volume = {16},
 number = {4},
 year = {2004},
 issn = {1041-4347},
 pages = {434--447},
 doi = {http://dx.doi.org/10.1109/TKDE.2004.1269668},
 publisher = {IEEE Educational Activities Department},
 address = {Piscataway, NJ, USA},
 }



Information Theory in Privacy Preserving Data Mining

• Poorvi Vora, "Towards a Theory of Variable Privacy," 2003. [link]

[abstract]

The traditional theory of security, with its focus on perfect secrecy, does not provide a satisfactory framework for the study of situations where information revelation bears a privacy cost and also provides a benefit. We define variable privacy as the use of randomization with user participation in the choice of parameters, and propose the beginnings of a theory for its study. Variable privacy enables the user, or a computational agent working on the user's behalf, to choose a level of interaction, based on a personal cost-benefit analysis of an instance of information revelation.

Our theory is based on treating the randomization protocol as a channel for the information to be protected. We demonstrate a one-to-one correspondence between channel codes and a certain class of attacks. Shannon's theorems show the existence of very special attacks, and upper bounds on their efficiency. We use the bounds to motivate a privacy measure of randomization similar to one proposed the database literature, and thus provide connections between the theory of security and statistical privacy protection techniques.

We are not aware of any other work that uses Shannon's theorems to construct the special attacks on Cryptographic protocols. We are also not aware of any other work that connects error-correcting codes to attacks on randomization.

[bib]

''This is the bibtex entry ... ''


Security and Privacy in RFID Systems

• Gildas Avoine, a bibliography on works related to security and privacy in RFID systems.[link]

[abstract]

The goal of this page is to reference works related to security and privacy in RFID systems.

[bib]

''this is the bibtex entry ...''



Privacy Preserving Case Study

• Moni Naor and Benny Pinkas and Reuban Sumner, "Privacy Preserving Auctions and Mechanism Design," Proceedings of the 1st ACM conference on Electronic commerce, pp. 129-139, 1999. [link]

[abstract]

We suggest an architecture for executing protocols for auctions and, in general terms, mechanism design. Our goal is to preserve the privacy of the inputs of the participants (so that no one learns any non-essential information about them, even a posteriori) while maintaining communication and computational efficiency.

We achieve this goal by adding another party - the Auction Issuer - that generates the programs for computing the auctions but does not take an active part in the protocol. The auctioneer issuer is not a trusted party, but rather it is assumed that it does not collude with the auctioneer. For auctions, as long as the auctioneer and the auction issuer do not collude, none of them gains any information about the bids, even after the auction is over. Moreover, bidders can verify that the auction was performed correctly.

The protocols do not require any communication between the bidders and the auction issuer and the computational overhead is very reasonable. This architecture can be used to implement any mechanism design where the significant factor is the complexity of the decision procedure.

[bib]

@inproceedings{337028,
 author = {Moni Naor and Benny Pinkas and Reuban Sumner},
 title = {Privacy preserving auctions and mechanism design},
 booktitle = {EC '99: Proceedings of the 1st ACM conference on Electronic commerce},
 year = {1999},
 isbn = {1-58113-176-3},
 pages = {129--139},
 location = {Denver, Colorado, United States},
 doi = {http://doi.acm.org/10.1145/336992.337028},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }

• Gunther Schadow, Shaun J. Grannis, Clement J. McDonald, "Privacy-Preserving Distributed Queries for a Clinical Case Research Network ," appeared at the IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, 2002. [link]

[abstract]

We present the motivation, use-case and requirements of a clinical case research network that would allow biomedical researchers to perform retrospective analysis on de-identified clinical cases joined across a large scale (nationwide) distributed network. Based on semi-join adaptive plans for fusion-queries, in this paper we discuss how joining can be done in a way that protects the privacy of the individual patients involved. Our method is based on a cryptographically strong keyed-hash algorithm (HMAC.) These hash values are truncated and the resulting hash-collisions in semi-join filters are exploited to limit the ability of an apprentice-site to re-identify patients in the filter. As a measure of privacy we use likelihood ratios. Since the join key is based on real person identifiers, we need to apply the methods of record linkage to hashing and semi-join filters. We find that multiple disjunctive rules as used in deterministic matching, lead here to a higher privacy risk than rules based on a single identifier vector.

[bib]

''this is the bibtex entry ...''



Link Farm