Privacy preserving big data mining:  association rule hiding using fuzzy  logic  approach

Abstract

Recently, privacy preserving data mining has been studied widely. Association rule mining can cause potential threat  toward privacy of data. So, association rule hiding techniques are employed to avoid the risk of sensitive knowledge leakage.  Many researches have been done on association rule hiding, but most of them focus on proposing algorithms with least side  effect for static databases (with no new data entrance), while now the authors confront with streaming data which are  continuous data. Furthermore, in the age of big data, it is necessary to optimise existing methods to be executable for large  volume of data. In this study, data anonymisation is used to fit the proposed model for big data mining. Besides, special features  of big data such as velocity make it necessary to consider each rule as a sensitive association rule with an appropriate  membership degree. Furthermore, parallelisation techniques which are embedded in the proposed model, can help to speed up  data mining process.

Exiting System 

In terms of definition, big data refers to high volume of structured,  semi-structured and unstructured data with high velocity which can  be mined for information . Big data mining refers to the  capability of extracting information from massive datasets that due  to specific features cannot be done using existing data mining  techniques.  In many situations, it is infeasible to store this huge  amount of data, so the knowledge extraction should be done real  time. Processing big data needs a cluster of computers with high  computing performance and this framework would be practical  with parallel programming models such as MapReduce . Owing to the novel features of big data such as high volume  and variety in data structures, essential updates should be  considered in mentioned techniques to satisfy related requirements.  In this model, generalisation technique is used for anonymity,  while suppression technique is not suitable for quantity data and  randomisation technique imposes significant overhead to systems.

Proposed System 

In this research, in order to  hide sensitive association rules in big data mining, instead of  removing repeated instance of sensitive association rules,  anonymisation methods are used to hide sensitive rules. By doing  this, besides hiding sensitive information, undesired side effect of  deleting frequent item-sets (ISs) on new entrance data, would be  removed. To make this approach suitable for big data analysing,  parallelisation and scalability features are considered, too.  Sensitive degree of each association rule is determined using  appropriate membership functions and anonymisation would be  done based on it. Although, many  researches have been done in association rule hiding, there are  significant drawbacks in most of them:  • Boolean logic (versus fuzzy logic) approach to determine  whether the association rule is sensitive or not.  • Undesired side effect of hiding sensitive association rules on  non-sensitive rules.  • The impossibility of using in big data analysis.

Conclusion

Association rule mining, besides its benefits in discovering unclear  relationships between data, will result privacy violation.  Association rule hiding can help to protect sensitive association  rules to be discovered. Many different techniques have been  considered to hide certain association rules, but most of them try to  select ISs in order to decrease the confidence value to be less than  the defined threshold. None of existing approaches can be executed  in a parallel and scalable manner, to be appropriate for big data  mining. Besides, removing ISs from the database can cause serious  information loss as new data stream arrive. In this research, new  big data association rule hiding technique is presented, which uses  fuzzy logic approach, tries to decrease undesired side effect of  sensitive rule hiding on non-sensitive rules in data streams.  Features such as parallelism and scalability are embedded in the  proposed model to provide the facility of implementing this model  for huge volume of data. Results show that the proposed model can  be more effective in big data mining than existing rule hiding  approaches. As future work, we will try to decrease undesired side  effect of the proposed model to gain less information loss.

References 

[1] Philip, C.L.C., Zh, C.-Y.: ‘Data-intensive applications, challenges, techniques  and technologies: a survey on big data’, Inf. Sci., 2014, 275, pp. 314–347

[2] Ohbyung, K., Namyeon, L., Bongsik, S.: ‘Data quality management, data  usage experience and acquisition intention of big data analytics’, Int. J. Inf.  Manage., 2014, 34, (3), pp. 387–394

[3] Alfredo, C., Carson, K.S.L., Richard, K.M.: ‘Mining constrained frequent  item-sets from distributed uncertain data’, Future Gener. Comput. Syst., 2014,  37, pp. 117–126

[4] Xuyun, Z., Chang, L., Surya, N.S., et al.: ‘A hybrid approach for scalable subtree  anonymization over big data using MapReduce on cloud’, J. Comput.  Syst. Sci., 2014, 80, (5), pp. 1008–1020

[5] Yaping, L., Minghua, C., Qiwei, L., et al.: ‘Enabling multilevel trust in  privacy preserving data mining’, IEEE Trans. Knowl. Data Eng., 2012, 24,  (9), pp. 1589–1612

[6] Yi-Huang, W., Chia-Ming, C., Arbee, L.P.C.: ‘Hiding sensitive association  rules with limited side effects’, IEEE Trans. Knowl. Data Eng., 2007, 19, (1),  pp. 29–42

[7] Aris, G.D., Vassilios, S.V.: ‘Exact knowledge hiding through database  extension’, IEEE Trans. Knowl. Data Eng., 2009, 21, (5), pp. 699–713

[8] Hai, Q.C., Somjit, A.I., Huy, X.N., et al.: ‘Association rule hiding in risk  management for retail supply chain collaboration’, Comput. Ind., 2013, 64,  (4), pp. 776–784

[9] Yu-Chiang, L., Jieh-Shan, Y., Chin-Chen, C.: ‘MCIF: an effective sanitization  algorithm for hiding sensitive patterns on data mining’, Adv. Eng. Inf., 2007,  21, (3), pp. 269–280

[10] Bettahally, N.K., Durga, T., Bhavani, K.E.: ‘Hiding co-occurring prioritized  sensitive patterns over distributed progressive sequential data streams’, J.  Netw. Comput. Appl., 2012, 35, (3), pp. 1116–1129

[11] Xin, W., Xingquan, Z., Gong-Qing, W., et al.: ‘Data mining with big  data’, IEEE Trans. Knowl. Data Eng., 2014, 26, (1), pp. 97–107

[12] Mehmet, E.N., Muhammed, Z.G.: ‘Hybrid K-anonymity’, Comput. Secur.,  2014, 44, pp. 51–63

[13] Bing, L., Esra, E., Mehmet, H.G., et al.: ‘An overview of anonymity  technology usage’, Comput. Commun., 2013, 36, (12), pp. 1269–1283

[14] Anna, M., Gennady, A., Natalia, A., et al.: ‘Movement data anonymity  through generalization’, Trans. Data Priv., 2010, 3, (2), pp. 1–31

[15] Slava, K., Lior, R., Yuval, E., et al.: ‘Efficient multidimensional suppression  for K-anonymity’, IEEE Trans. Knowl. Data Eng., 2010, 22, (3), pp. 334–347