A Comparative Study of K-Means and KNN Imputation for Handling Missing Data in Scholarship Applicant Datasets

Muhammad Muhammad; Tole Sutikno; Imam Riadi

doi:10.30595/juita.v13i3.26502

Authors

Muhammad Muhammad Department of Informatics, Universitas Ahmad Dahlan
Tole Sutikno Department of Electrical Engineering, Universitas Ahmad Dahlan
Imam Riadi Department of Information System, Universitas Ahmad Dahlan

DOI:

https://doi.org/10.30595/juita.v13i3.26502

Keywords:

imputation, K-Means, KNN, missing values

Abstract

Handling missing values is a key issue in data processing, especially in financial records of prospective scholarship recipients where precision is vital for effective decision making. This research aims to analyze the effectiveness of two commonly used imputation methods, namely K-Nearest Neighbors (KNN) and K-Means, in filling missing values across key attributes such as Semester, Grade Point Average (GPA), number of dependents, number of credits, and parental income. Performance evaluation was conducted using Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE). The results indicate that KNN generally provides more stable and accurate imputations, particularly for attributes with homogeneous distributions such as Semester and GPA, while K-Means demonstrates competitive performance on attributes with higher variability, provided that the number of clusters is optimally defined. Nonetheless, K-Means tends to be more sensitive to increasing proportions of missing data. These findings underscore the importance of selecting imputation methods that align with attribute distribution characteristics and the extent of missing data in order to develop reliable predictive models, as observed in scenarios with 15% and 25% missing data. The findings can also serve as a reference for developing more accurate scholarship selection processes in the presence of incomplete financial data.

References

[1] M. Afkanpour, E. Hosseinzadeh, and H. Tabesh, “Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review,” BMC Med Res Methodol, vol. 24, no. 1, pp. 1–13, Dec. 2024, doi: 10.1186/s12874-024-02310-6.

[2] X. Miao, Y. Wu, L. Chen, Y. Gao, and J. Yin, “An Experimental Survey of Missing Data Imputation Algorithms,” IEEE Trans Knowl Data Eng, vol. 35, no. 7, pp. 6630–6650, 2023, doi: 10.1109/TKDE.2022.3186498.

[3] P. Keerin and T. Boongoen, “Improved KNN imputation for missing values in gene expression data,” Computers, Materials and Continua, vol. 70, no. 2, pp. 4009–4025, 2022, doi: 10.32604/cmc.2022.020261.

[4] K. Seu, M. S. Kang, and H. Lee, “An Intelligent Missing Data Imputation Techniques: A Review,” International Journal On Informatics Visualization, vol. 6, no. 1, pp. 278–283, May 2022, doi: 10.30630/joiv.6.1-2.935.

[5] A. Fadlil, H. Herman, and D. M. Praseptian, “Single Imputation Using Statistics-Based and K Nearest Neighbor Methods for Numerical Datasets,” Ingenierie des Systemes d’Information, vol. 28, no. 2, pp. 451–459, Apr. 2023, doi: 10.18280/isi.280221.

[6] S. I. Khan and A. S. M. L. Hoque, “SICE: an improved missing data imputation technique,” J Big Data, vol. 7, no. 1, pp. 1–21, Dec. 2020, doi: 10.1186/s40537-020-00313-w.

[7] P. S. Raja and K. Thangavel, “Missing value imputation using unsupervised machine learning techniques,” Soft comput, vol. 24, no. 6, pp. 4361–4392, Mar. 2020, doi: 10.1007/s00500-019-04199-6.

[8] N. Z. Abidin, A. R. Ismail, and N. A. Emran, “Performance Analysis of Machine Learning Algorithms for Missing Value Imputation,” IJACSA) International Journal of Advanced Computer Science and Applications, vol. 9, no. 6, pp. 442–447, 2018, doi: 10.14569/IJACSA.2018.090660.

[9] M. Alabadla, F. Sidi, I. Ishak, H. Ibrahim, and L. S. Affendey, “Systematic Review of Using Machine Learning in Imputing Missing Values,” IEEE Access, vol. 10, pp. 44483–44502, Apr. 2022, doi: 10.1109/ACCESS.2022.3160841.

[10] M. Muhammad, T. Sutikno, and I. Riadi, “K-means clustering as an imputation strategy for missing values in scholarship candidate data,” Mantik Journal, vol. 8, no. 4, pp. 2685–4236, 2025, doi: 10.35335/mantik.v8i4.5904.

[11] S. Wang, M. Li, N. Hu, E. Zhu, and J. Hu, “K-Means Clustering With Incomplete Data,” IEEE Access, vol. 7, pp. 69162–69171, 2019, doi: 10.1109/ACCESS.2019.2910287.

[12] K. Hadi and E. Utami, “Analysis of K-NN with the Integration of Bag of Words, TF-IDF, and N-Grams for Hate Speech Classification on Twitter,” JUITA: Jurnal Informatika, vol. 12, no. 2, pp. 289–298, 2024, doi: 10.30595/juita.v12i2.23829.

[13] M. Ahmed, R. Seraj, and S. M. S. Islam, “The k-means algorithm: A comprehensive survey and performance evaluation,” Electronics (Basel), vol. 9, no. 8, pp. 1–12, Aug. 2020, doi: 10.3390/electronics9081295.

[14] K. P. Sinaga and M. S. Yang, “Unsupervised K-means clustering algorithm,” IEEE Access, vol. 8, pp. 80716–80727, 2020, doi: 10.1109/ACCESS.2020.2988796.

[15] K. Taunk, S. De, S. Verma, and A. Swetapadma, “A Brief Review of Nearest Neighbor Algorithm for Learning and Classification,” in 2019 International Conference on Intelligent Computing and Control Systems (ICCS), IEEE, May 2019, pp. 1255–60. doi: 10.1109/ICCS45141.2019.9065747.

[16] D. M. P. Murti, A. P. Wibawa, M. I. Akbar, and U. Pujianto, “K-Nearest Neighbor (K-NN) based Missing Data Imputation,” in 2019 5th International Conference on Science in Information Technology (ICSITech), IEEE, Oct. 2019, pp. 1–6. doi: 10.1109/ICSITech46713.2019.8987530.

[17] F. Y. Pamuji, A. R. Muslikh, R. M. Arief, and D. Muti, “Komparasi Metode Mean dan KNN Imputation Dalam Mengatasi Missing Value Pada Dataset Kecil,” JIP (Jurnal Informatika Polinema) , vol. 10, no. 2, pp. 257–264, 2024, doi: 10.33795/jip.v10i2.5031.

[18] A. Fadlil, H. Herman, and D. M. Praseptian, “K Nearest Neighbor Imputation Performance on Missing Value Data Graduate User Satisfaction,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 6, no. 4, pp. 570–576, Aug. 2022, doi: 10.29207/resti.v6i4.4173.

[19] I. D. Oktaviani and A. G. Putrada, “KNN imputation to missing values of regression-based rain duration prediction on BMKG data,” JURNAL INFOTEL, vol. 14, no. 4, pp. 249–254, Nov. 2022, doi: 10.20895/infotel.v14i4.840.

[20] M. Lutfi and M. Hasyim, “Penanganan Data Missing Value Pada Kualitas Produksi Jagung Dengan Menggunakan Metode K-NN Imputation Pada Algoritma C4.5,” JURNAL RESISTOR, vol. 2, no. 2, 2019, doi: 10.31598/jurnalresistor.v2i2.427.

[21] D. Chicco, M. J. Warrens, and G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation,” PeerJ Comput Sci, vol. 7, pp. 1–24, 2021, doi: 10.7717/PEERJ-CS.623.

[22] U. Khair, H. Fahmi, S. Al Hakim, and R. Rahim, “Forecasting Error Calculation with Mean Absolute Deviation and Mean Absolute Percentage Error,” in Journal of Physics: Conference Series 930, Institute of Physics Publishing, Dec. 2017, pp. 1–7. doi: 10.1088/1742-6596/930/1/012002.