MRMondrian: Scalable Multidimensional Anonymisation for Big Data Privacy Preservation
Abstract:
Scalable data processing platforms built on cloud computing becomes increasingly attractive as infrastructure for supporting big data applications. But privacy concerns are one of the major obstacles to making use of public cloud platforms. Multidimensional anonymisation, a global-recoding generalisation scheme for privacy-preserving data publishing, has been a recent focus due to its capability of balancing data obfuscation and usability. Existing multidimensional anonymisation methods suffer from scalability problems when handling big data due to the impractical serial I/O cost. Given the recursive feature of multidimensional anonymisation, parallelisation is an ideal solution to scalability issues. However, it is still a challenge to use existing distributed and parallel paradigms directly for recursive computation. In this paper, we propose a scalable approach for big data multidimensional anonymisation based on MapReduce, a state-of-the-art data processing paradigm. Our basic idea is to partition a data set recursively into smaller partitions using MapReduce until all partitions can fit in the memory of a computing node. A tree indexing structure is proposed to achieve recursive computation. Moreover, we show the applicability of our approach to differential privacy. Experimental results on real-life data demonstrate that our approach can significantly improve the scalability of multidimensional anonymisation over existing methods.
Existing System:
Existing multidimensional anonymisation methods suffer from scalability problems when handling big data due to the impractical serial I/O cost. Given the recursive feature of multidimensional anonymisation, parallelisation is an ideal solution to scalability issues. However, it is still a challenge to use existing distributed and parallel paradigms directly for recursive computation. In this paper, we propose a scalable approach for big data multidimensional anonymisation based on MapReduce, a state-of-the-art data processing paradigm. Our basic idea is to partition a data set recursively into smaller partitions using MapReduce until all partitions can fit in the memory of a computing node. A tree indexing structure is proposed to achieve recursive computation. Moreover, we show the applicability of our approach to differential privacy. Experimental results on real-life data demonstrate that our approach can significantly improve the scalability of multidimensional anonymisation over existing methods.
Proposed System:
we propose a scalable multidimensional anonymisation approach for big data sets using MapReduce , a state-of-the-art big data processing paradigm. The approach is named MRMondrian for reference, following the naming of Mondrian algorithms. Our intuitive idea is to partition a large data set recursively into smaller data partitions using MapReduce until all partitions can fit in the memory of each computing node. Then, traditional Mondrian algorithms can be executed on each single node in a parallel fashion. To avoid launching multiple MapReduce jobs recursively for data partitioning, we propose to accomplish data partitioning in a single MapReduce job. A Partition ID indexing tree (PID-tree) data structure is developed to support recursive computation on MapReduce. We leverage coefficient of variation as the heuristics to select the splitting attribute for categorical attributes, and design a scalable median-finding algorithm for numerical attributes based on the idea of the median of medians and the histogram technique. MRMondrian is further extended to be applicable to the differential privacy model. Extensive experiments on real-life data demonstrate that our approach can significantly improve the scalability of multidimensional anonymisation.
Conclusion:
we have proposed a highly scalable and efficient approach named MRMondrian for multidimensional anonymisation over big data based on the MapReduce paradigm. Following the basic idea of dividing data sets into small data partitions to make them fit into the main memory of a single node, we have proposed to conduct iterative data partitioning in a parallel manner. Concrete data partitioning is performed when all data partitions can fit into main memory. Then, each data partition is further divided recursively by the traditional serial Mondrian method on a single node. To support iterative data partitioning, a tree structure named PID-tree has been proposed to index data partitions for searching partition IDs. Coefficient of variation has been leveraged to select splitting attributes for a categorical attribute, while a scalable method based on the idea of the median of medians and the histogram technique has been proposed to find the median for a numerical attribute. We have also extended our approach to differential privacy. Experimental results from real-world data sets have shown that our approach can significantly improve the scalability and time-efficiency of the multidimensional scheme over existing approaches. In the future, we plan to integrate our approach with scalable data mining platforms to achieve scalable privacy-preserving big data mining.
REFERENCES
[1] X. Zhang, L. Qi, Q. He, and W. Dou, “Scalable iterative implementation of mondrian for big data multidimensional anonymisation,” in DependSys, 2016, pp. 311–320.
[2] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big data,” IEEE TKDE, vol. 26, no. 1, pp. 97–107, 2014.
[3] W. Fan and A. Bifet, “Mining big data: current status, and forecast to the future,” ACM SIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 1–5, 2013.
[4] J. Lin and D. Ryaboy, “Scaling big data mining infrastructure: the twitter experience,” ACM SIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 6–19, 2013.
[5] S. Chaudhuri, “What next?: A half-dozen data management research goals for big data and the cloud,” in Proc. PODS’12, 2012, pp. 1–4.
[6] B. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving data publishing: A survey of recent developments,” ACM Computing Surveys, vol. 42, no. 4, p. 14, 2010.
[7] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Workload-aware anonymization techniques for large-scale datasets,” ACM TODS, vol. 33, no. 3, p. 17, 2008.
[8] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W.-C. Fu, “Utilitybased anonymization using local recoding,” in Proc. SIGKDD’06, 2006, pp. 785–790.
[9] B. C. Fung, K. Wang, and P. S. Yu, “Anonymizing classification data for privacy preservation,” IEEE TKDE, vol. 19, no. 5, pp. 711– 725, 2007.
[10] X. Zhang, L. T. Yang, C. Liu, and J. Chen, “A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud,” IEEE TPDS, vol. 25, no. 2, pp. 363–373, 2014.
[11] X. Zhang, W. Dou, J. Pei, S. Nepal, C. Yang, C. Liu, and J. Chen, “Proximity-aware local-recoding anonymization with mapreduce for scalable big data privacy preservation in cloud,” IEEE Trans. Computers, vol. PP, no. 99, 2014.
[12] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Mondrian multidimensional k-anonymity,” in Proc. ICDE’06, 2006, pp. 25–25.