On Distributed Fuzzy Decision Trees for Big Data
Abstract:
Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce programming model for generating both binary and multi-way FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are therefore used as input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: i) performance in terms of classification accuracy, model complexity and execution time, ii) scalability varying the number of computing units and iii) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis.
Existing System:
A decision tree can be generated efficiently from very large datasets. The various techniques proposed in the literature can be roughly grouped into two categories, which are characterized by performing a pre-sorting of the data or by adopting approximate representations of the data such as samples and/or histograms [46]. While pre-sorting techniques are more accurate, they cannot accommodate very large datasets or streaming data [46]. One of the oldest approaches in the first category is SLIQ, proposed in [47]. SLIQ reduces decision tree learning time without loss in accuracy by exploiting a pre-sorting technique in the tree-growth phase. This technique is integrated with a breadth-first tree growing strategy to enable classification of disk-resident datasets.
Proposed System:
we propose a distributed fuzzy discretizer and a distributed FDT (DFDT) learning scheme upon the MapReduce programming model for managing big data. To the best of our knowledge, in the context of big data, no distributed discretizer for generating fuzzy partitions and no DFDT have been proposed in the literature. Our novel discretizer generates a strong fuzzy partition for each continuous attribute by using a purposely adapted distributed version of the well-known method proposed by Fayyad and Irani in [44]. The fuzzy partitions computed by the discretizer are used as input to the DFDT learning algorithm. We adopt and compare two different versions of the learning algorithm based on binary and multi-way splits, respectively. Both the versions employ the information gain computed in terms of fuzzy entropy for selecting the attribute to be adopted at each decision node.
Conclusions:
We have proposed a distributed fuzzy decision tree (FDT) learning scheme shaped according to the MapReduce programming model for generating both binary (FBDT) and multiway (FMDT) FDTs from big data. We have first introduced a novel distributed fuzzy discretizer, which generates strong fuzzy partitions for each continuous attribute based on fuzzy information entropy. Then, we have discussed a distributed implementation of an FDT learning algorithm, which employs the fuzzy information gain for selecting the attributes to be used in the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. we believe that the work presented in this paper is the first extensive study on the application of FDTs to big data, considering both binary and multi-way splits. We expect that the experimental results can be used as baseline for future research in this field.
REFERENCES
[1] R. Diao, K. Sun, V. Vittal, R. J. O’Keefe, M. R. Richardson, N. Bhatt, D. Stradford, and S. K. Sarawgi, “Decision tree-based online voltage security assessment using PMU measurements,” IEEE Transactions on Power Systems, vol. 24, no. 2, pp. 832–839, 2009.
[2] T. Goetz, The decision tree: Taking control of your health in the new era of personalized medicine. Rodale Inc., 2010.
[3] Y. Zheng, L. Liu, L. Wang, and X. Xie, “Learning transportation mode from raw gps data for geographic applications on the web,” in Proceedings of the 17th international conference on World Wide Web, 2008, pp. 247–256.
[4] J. Han, M. Kamber, and J. Pei, Data mining: Concepts and techniques. Elsevier, 2011.
[5] L. Rokach and O. Maimon, Data mining with decision trees: Theory and applications. World scientific, 2014.
[6] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986.
[7] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984.
[8] C. Z. Janikow, “Fuzzy decision trees: Issues and methods,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 28, no. 1, pp. 1–14, 1998.
[9] Y.-l. Chen, T. Wang, B.-s. Wang, and Z.-j. Li, “A survey of fuzzy decision tree classifier,” Fuzzy Information and Engineering, vol. 1, no. 2, pp. 149–159, 2009.
[10] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993.
[11] X. Liu, X. Feng, and W. Pedrycz, “Extraction of fuzzy rules from fuzzy decision trees: An axiomatic fuzzy sets (AFS) approach,” Data & Knowledge Engineering, vol. 84, pp. 1–25, 2013.
[12] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, “The elements of statistical learning: Data mining, inference and prediction,” The Mathematical Intelligencer, vol. 27, no. 2, pp. 83–85, 2005.
[13] H. Kim and W.-Y. Loh, “Classification trees with unbiased multiway splits,” Journal of the American Statistical Association, pp. 589–604, 2011.
[14] F. Berzal, J.-C. Cubero, N. Marın, and D. Sanchez, “Building multi- ´ way decision trees with numerical attributes,” Information Sciences, vol. 165, no. 1, pp. 73–90, 2004.
[15] Y. Yuan and M. J. Shaw, “Induction of fuzzy decision trees,” Fuzzy Sets and systems, vol. 69, no. 2, pp. 125–139, 1995.
[16] R. Weber, “Fuzzy-ID3: A class of methods for automatic knowledge acquisition,” in Int. Conf. on Fuzzy Logic & Neural Networks, 1992, pp. 265–268.
[17] M. Zeinalkhani and M. Eftekhari, “Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers,” Information Sciences, vol. 278, pp. 715–735, 2014.
[18] S. Garcia, J. Luengo, J. A. Saez, V. Lopez, and F. Herrera, “A survey of ´ discretization techniques: Taxonomy and empirical analysis in supervised learning,” IEEE Trans. on Knowledge and Data Engineering, vol. 25, no. 4, pp. 734–750, 2013.