Paper title:

MR-Tree - A Scalable MapReduce Algorithm for Building Decision Trees

Published in: Issue 1, (Vol. 8) / 2014
Publishing date: 2014-03-25
Pages: 16-19
Author(s): PURDILĂ Vasile , PENTIUC Ştefan-Gh.
Abstract. Learning decision trees against very large amounts of data is not practical on single node computers due to the huge amount of calculations required by this process. Apache Hadoop is a large scale distributed computing platform that runs on commodity hardware clusters and can be used successfully for data mining task against very large datasets. This work presents a parallel decision tree learning algorithm expressed in MapReduce programming model that runs on Apache Hadoop platform and has a very good scalability with dataset size.
Keywords: Big Data, Decision Tree, Hadoop, MapReduce, Pattern Recognition.
References:

1. Quinlan, J. R. “Induction of Decision Trees.” Machine Learning 1, no. 1 (March 1, 1986): 81–106. doi:10.1007/BF00116251.

2. J. R. Quinlan, "C4.5: Programs for Machine Learning", San Mateo, CA: Morgan Kaufmann, 1993

3. Mehta, Manish, Rakesh Agrawal, and Jorma Rissanen. “SLIQ: A Fast Scalable Classifier for Data Mining.” In Proceedings of the 5th International Conference on EDT: Advances in Database Technology, 18–32. EDBT ’96. London, UK, UK: Springer-Verlag, 1996.

4. BREIMAN, Leo, Jerome H. FRIEDMAN, Richard A. OLSHEN, and Charles J. STONE. “Classification and Regression Trees (POD)” (1999).

5. Shafer, John C., Rakesh Agrawal, and Manish Mehta. “SPRINT: A Scalable Parallel Classifier for Data Mining.” In Proceedings of the 22th International Conference on Very Large Data Bases, 544–555. VLDB ’96. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1996.

6. Panda, Biswanath, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. “PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce.” Proc. VLDB Endow. 2, no. 2 (August 2009): 1426–1437.

7. Yin, Wei, Yogesh Simmhan, and Viktor K. Prasanna. “Scalable Regression Tree Learning on Hadoop Using OpenPlanet.” In Proceedings of Third International Workshop on MapReduce and Its Applications Date, 57–64. MapReduce ’12. New York, NY, USA: ACM, 2012. doi:10.1145/2287016.2287027. 8. http://adayinbigdata.com/

9. Opitz, David, and Richard Maclin. “Popular Ensemble Methods: An Empirical Study.” Journal of Artificial Intelligence Research 11 (1999): 169–198.

10. Isard, Michael, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks.” In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, 59–72. EuroSys ’07. New York, NY, USA: ACM, 2007. doi:10.1145/1272996.1273005.

11. Grossman, Robert, and Yunhong Gu. “Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere.” In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 920–927. KDD ’08. New York, NY, USA: ACM, 2008. doi:10.1145/1401890.1402000.

12. Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Commun. ACM 51, no. 1 (January 2008): 107–113. doi:10.1145/1327452.1327492.

13. US Census Bureau, adult dataset: http://www.sgi.com/tech/mlc/db/adult.names

14. Apache Hadoop: http://hadoop.apache.org

Back to the journal content
Creative Commons License
This article is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.
Home | Editorial Board | Author info | Archive | Contact
Copyright JACSM 2007-2024