Resampling Technique for Imbalanced Class Handling on Educational Dataset

Anief Fauzan Rozi; Adi Wibowo; Budi Warsito

doi:10.30595/juita.v11i1.15498

Authors

Anief Fauzan Rozi School of Postgraduate Studies Diponegoro University
Adi Wibowo Dept. Informatics Diponegoro University
Budi Warsito School of Postgraduate Studies Diponegoro University

DOI:

https://doi.org/10.30595/juita.v11i1.15498

Keywords:

Imbalanced class, ADASYN, SMOTE, SMOTE-ENN, educational data mining

Abstract

Educational data mining is an emerging field in data mining. The need for accurate in identifying student accomplishment on a course or maybe an upcoming course can help the institution to build technology-aided education better. Educational data mining becoming a more important field to be studied because of its potential to produce a knowledge base model to help even the teacher or lecturer. Like another classification task, educational data mining has a common and frequently discovered problem. The problem that occurred in educational data mining specifically and classification tasks generally is an imbalanced class problem. An imbalanced class is a condition where the distribution of each class is not in the same proportion. In this research, it is found that the class distribution is severely imbalanced and it is a multiclass dataset that consists of more than two class labels. According to the problem stated beforehand, this paper will focus on the imbalanced class handling and classification with several methods on both of it such as Linear Regression, Random Forest and Stacking for classification and SMOTE, ADASYN, and SMOTE-ENN for the resampling algorithm. The methods are being evaluated using a 10-fold cross-validation and an 80-20 splitting ratio. The result shows that the best performance coming from the Stacking classification on ADASYN resampled dataset evaluated using an 80-20 splitting ratio with a 0.97 F1 score. The result of this study also shows that the resampling technique improves classification performance. Even though the no-resampling classification result produced a decent result too, it can be caused by several things such as the general pattern of the data for each class is already been good from the start. Thus, there is no real drawbacks if the original data is processed.

References

[1] R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 10.1109/ACCESS.2020.2986809.

[2] A. I. Adekitan and O. Salau, “The impact of engineering students’ performance in the first three years on their graduation result using educational data mining,” Heliyon, vol. 5, no. 2, p. e01250, 2019, doi: 10.1016/j.heliyon.2019.e01250.

[3] M. Utari, B. Warsito, and R. Kusumaningrum, “Implementation of Data Mining for Drop-Out Prediction using Random Forest Method,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166276.

[4] P. Dabhade, R. Agarwal, K. P. Alameen, A. T. Fathima, R. Sridharan, and G. Gopakumar, “Educational data mining for predicting students’ academic performance using machine learning algorithms,” Mater. Today Proc., vol. 47, no. xxxx, pp. 5260–5267, 2021, doi: 10.1016/j.matpr.2021.05.646.

[5] M. Tsiakmaki, G. Kostopoulos, S. Kotsiantis, and O. Ragos, “Implementing AutoML in Educational Data Mining for Prediction Tasks,” Applied Sciences , vol. 10, no. 1. 2020. doi: 10.3390/app10010090.

[6] E. A. Amrieh, T. Hamtini, and I. Aljarah, “Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods,” Int. J. Database Theory Appl., vol. 9, no. 8, pp. 119–136, 2016, doi: 10.14257/ijdta.2016.9.8.13.

[7] E. Buraimoh, R. Ajoodha, and K. Padayachee, “Importance of Data Re-Sampling and Dimensionality Reduction in Predicting Students’ Success,” 3rd Int. Conf. Electr. Commun. Comput. Eng. ICECCE 2021, no. June, pp. 12–13, 2021, doi: 10.1109/ICECCE52056.2021.9514123.

[8] G. Kovács, “Smote-variants: A python implementation of 85 minority oversampling techniques,” Neurocomputing, vol. 366, pp. 352–354, 2019, doi: 10.1016/j.neucom.2019.06.100.

[9] Y. Pristyanto, I. Pratama, and A. F. Nugraha, “Data level approach for imbalanced class handling on educational data mining multiclass classification,” in 2018 International Conference on Information and Communications Technology, ICOIACT 2018, 2018, vol. 2018-Janua. doi: 10.1109/ICOIACT.2018.8350792.

[10] V. S. Spelmen and R. Porkodi, “A Review on Handling Imbalanced Data,” Proc. 2018 Int. Conf. Curr. Trends Towar. Converging Technol. ICCTCT 2018, pp. 1–11, 2018, doi: 10.1109/ICCTCT.2018.8551020.

[11] K. Borowska and J. Stepaniuk, “Imbalanced data classification: A novel re-sampling approach combining versatile improved SMOTE and rough sets,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9842 LNCS, no. 1, pp. 31–42, 2016, doi: 10.1007/978-3-319-45378-1_4.

[12] F. Sağlam and M. A. Cengiz, “A novel SMOTE-based resampling technique trough noise detection and the boosting procedure,” Expert Syst. Appl., vol. 200, no. April 2020, pp. 1–12, 2022, doi: 10.1016/j.eswa.2022.117023.

[13] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, Jun. 2004, doi: 10.1145/1007730.1007735.

[14] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” Proc. Int. Jt. Conf. Neural Networks, no. 3, pp. 1322–1328, 2008, doi: 10.1109/IJCNN.2008.4633969.

[15] E. P. W. Mandala, E. Rianti, and S. Defit, “Classification of Customer Loans Using Hybrid Data Mining,” JUITA J. Inform., vol. 10, no. 1, p. 45, 2022, doi: 10.30595/juita.v10i1.12521.

[16] M. R. Romadhon and F. Kurniawan, “A Comparison of Naive Bayes Methods, Logistic Regression and KNN for Predicting Healing of Covid-19 Patients in Indonesia,” 3rd 2021 East Indones. Conf. Comput. Inf. Technol. EIConCIT 2021, pp. 41–44, 2021, doi: 10.1109/EIConCIT50028.2021.9431845.

[17] B. A. Akinnuwesi et al., “Application of intelligence-based computational techniques for classification and early differential diagnosis of COVID-19 disease,” Data Sci. Manag., vol. 4, pp. 10–18, 2021, doi: https://doi.org/10.1016/j.dsm.2021.12.001.

[18] P. Cortez and A. Silva, “Using data mining to predict secondary school student performance,” 15th Eur. Concurr. Eng. Conf. 2008, ECEC 2008 - 5th Futur. Bus. Technol. Conf. FUBUTEC 2008, vol. 2003, no. 2000, pp. 5–12, 2008.

[19] H. Jiawei and M. Kamber, Data mining: concepts and techniques. 2001. doi: 10.1002/1521-3773(20010316)40:6<9823::AID-ANIE9823>3.3.CO;2-C.

[20] J. Ling and G. Li, “A two-level stacking model for detecting abnormal users in Wechat activities,” Proc. - 2019 Int. Conf. Inf. Technol. Comput. Appl. ITCA 2019, pp. 229–232, 2019, doi: 10.1109/ITCA49981.2019.00057.

[21] C. F. Kurz, W. Maier, and C. Rink, “A greedy stacking algorithm for model ensembling and domain weighting,” BMC Res. Notes, vol. 13, no. 1, p. 70, 2020, doi: 10.1186/s13104-020-4931-7.

[22] K. Leartpantulak and Y. Kitjaidure, “Music genre classification of audio signals using particle swarm optimization and stacking ensemble,” iEECON 2019 - 7th Int. Electr. Eng. Congr. Proc., pp. 1–4, 2019, doi: 10.1109/iEECON45304.2019.8938995.

[23] R. Sikora and O. Al-Laymoun, “A modified stacking ensemble machine learning algorithm using genetic algorithms,” Artif. Intell. Concepts, Methodol. Tools, Appl., vol. 1, no. 1, pp. 395–405, 2016, doi: 10.4018/978-1-5225-1759-7.ch016.

[24] S. Džeroski and B. Ženko, “Is combining classifiers with stacking better than selecting the best one?,” Mach. Learn., vol. 54, no. 3, pp. 255–273, 2004, doi: 10.1023/B:MACH.0000015881.36452.6e.