A Comprehensive Evaluation of CatBoost and LightGBM Algorithms for Honorarium Prediction  on Categorical Datasets with Class Imbalance

Slamet Widodo; Fandy Setyo Utomo; Berlilana

doi:10.30595/juita.v13i3.27363

Authors

Slamet Widodo Universitas Amikom Purwokerto
Fandy Setyo Utomo Universitas Amikom Purwokerto
Berlilana Universitas Amikom Purwokerto

DOI:

https://doi.org/10.30595/juita.v13i3.27363

Keywords:

CatBoost, LightGBM, honorarium, categorical data, imbalanced classes, computational efficiency

Abstract

Determining income, including honoraria in the academic environment, is often done manually and subjectively, necessitating a predictive model to objectively determine the honorarium amount. However, the development of the prediction model faces challenges due to the dataset's characteristics, which include categorical data and an imbalanced class distribution. This research aims to evaluate the predictive performance and computational resource efficiency of the CatBoost and LightGBM algorithms in predicting honorariums. The dataset used includes 58,332 actual honorarium data of employees from higher education institution "A" in Purwokerto for the period from January 2024 to February 2025. The methods used include data preprocessing, dataset splitting using Stratified Splitting, modeling with CatBoost, LightGBM, Random Forest, Neural Network, and Linear Regression, as well as evaluation using MSE, RMSE, MAE, R² metrics, and computational resources (execution time, memory, CPU time). LightGBM achieved an RMSE of 665.960 and an R² of 0.54, while recording the lowest memory usage at only 2.67 MB. CatBoost produced an RMSE of 667.395 and an R² of 0.53, excelling in processing categorical features without one-hot encoding. Meanwhile, Linear Regression showed the lowest accuracy and high memory usage. These results confirm that LightGBM is the most optimal choice for fast, efficient, and accurate honorarium predictions. However, this research is limited to testing in a laboratory environment. Further research is recommended to implement direct integration with an active database and the integration of information retrieval methods to enhance the effectiveness and security of real-time honorarium predictions, as well as to integrate interpretability methods such as SHAP to improve decision-making transparency.

References

[1] R. Kablaoui and A. Salman, “Machine Learning Models for Salary Prediction Dataset using Python,” in 2022 International Conference on Electrical and Computing Technologies and Applications, ICECTA 2022, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 143–147. doi: 10.1109/ICECTA57148.2022.9990316.

[2] J. Y. Kuo, H. C. Lin, and C. H. Liu, “Building graduate salary grading prediction model based on deep learning,” Intell. Autom. Soft Comput., vol. 27, no. 1, pp. 53–68, 2021, doi: 10.32604/iasc.2021.014437.

[3] A. Asaduzzaman, M. R. Uddin, Y. Woldeyes, and F. N. Sibai, “A Novel Salary Prediction System Using Machine Learning Techniques,” in Proceedings - 2024 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering, ECTI DAMT and NCON 2024, Institute of Electrical and Electronics Engineers Inc., 2024, pp. 38–43. doi: 10.1109/ECTIDAMTNCON60518.2024.10480058.

[4] D. A. Gomez-Cravioto, R. E. Diaz-Ramos, N. Hernandez-Gress, J. L. Preciado, and H. G. Ceballos, “Supervised machine learning predictive analytics for alumni income,” J. Big Data, vol. 9, no. 1, Dec. 2022, doi: 10.1186/s40537-022-00559-6.

[5] F. Li, N. A. Majid, and S. Ding, “Unlocking the potential of LSTM for accurate salary prediction with MLE, Jeffreys prior, and advanced risk functions,” PeerJ Comput. Sci., vol. 10, 2024, doi: 10.7717/peerj-cs.1875.

[6] G. Ramaswami, T. Susnjak, and A. Mathrani, “On Developing Generic Models for Predicting Student Outcomes in Educational Data Mining,” Big Data Cogn. Comput., vol. 6, no. 1, Mar. 2022, doi: 10.3390/bdcc6010006.

[7] A. Dmitry Devyatkin and G. Oleg Grigoriev, “Random Kernel Forests,” IEEE Access, vol. 10, pp. 77962–77979, 2022, doi: 10.1109/ACCESS.2022.3193385.

[8] J. Bilski, L. Rutkowski, J. Smoląg, and D. Tao, “A novel method for speed training acceleration of recurrent neural networks,” Inf. Sci. (Ny)., vol. 553, pp. 266–279, Apr. 2021, doi: 10.1016/j.ins.2020.10.025.

[9] S. Li and Y. Yang, “A recurrent neural network framework with an adaptive training strategy for long-time predictive modeling of nonlinear dynamical systems,” J. Sound Vib., vol. 506, Aug. 2021, doi: 10.1016/j.jsv.2021.116167.

[10] B. Amirshahi and S. Lahmiri, “Bankruptcy prediction using optimal ensemble models under balanced and imbalanced data,” Expert Syst., vol. 41, no. 8, Aug. 2024, doi: 10.1111/exsy.13599.

[11] R. Oktafiani, A. Hermawan, and D. Avianto, “Pengaruh Komposisi Split data Terhadap Performa Klasifikasi Penyakit Kanker Payudara Menggunakan Algoritma Machine Learning,” J. Sains dan Inform., pp. 19–28, Jun. 2023, doi: 10.34128/jsi.v9i1.622.

[12] E. Erlin, Y. Desnelita, N. Nasution, L. Suryati, and F. Zoromi, “Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 21, no. 3, pp. 677–690, Jul. 2022, doi: 10.30812/matrik.v21i3.1726.

[13] L. H. Li, R. Ahmad, R. Tanone, and A. K. Sharma, “STB: synthetic minority oversampling technique for tree-boosting models for imbalanced datasets of intrusion detection systems,” PeerJ Comput. Sci., vol. 9, 2023, doi: 10.7717/peerj-cs.1580.

[14] J. Y. Chiang, Y. Lio, C. Y. Hsu, C. L. Ho, and T. R. Tsai, “Binary Classification with Imbalanced Data,” Entropy, vol. 26, no. 1, Jan. 2024, doi: 10.3390/e26010015.

[15] C. Herdian, A. Kamila, and I. G. Agung Musa Budidarma, “Studi Kasus Feature Engineering Untuk Data Teks: Perbandingan Label Encoding dan One-Hot Encoding Pada Metode Linear Regresi,” Technol. J. Ilm., vol. 15, no. 1, p. 93, Jan. 2024, doi: 10.31602/tji.v15i1.13457.

[16] P. Septiana Rizky, R. Haiban Hirzi, U. Hidayaturrohman, U. Hamzanwadi Selong Jl TGKH Muhammad Zainuddin Abdul Madjid Pancor, and L. Timur, “Perbandingan Metode LightGBM dan XGBoost dalam Menangani Data dengan Kelas Tidak Seimbang,” 2022. [Online]. Available: www.unipasby.ac.id.

[17] Q. Chen, J. Li, J. Feng, and J. Qian, “Dynamic comprehensive quality assessment of postharvest grape in different transportation chains using SAHP–CatBoost machine learning,” Food Qual. Saf., vol. 8, 2024, doi: 10.1093/fqsafe/fyae007.

[18] J. Wu et al., “Fault diagnosis of the HVDC system based on the CatBoost algorithm using knowledge graphs,” Front. Energy Res., vol. 11, 2023, doi: 10.3389/fenrg.2023.1144785.

[19] V. Kumar, N. Kedam, K. V. Sharma, D. J. Mehta, and T. Caloiero, “Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models,” Water (Switzerland), vol. 15, no. 14, Jul. 2023, doi: 10.3390/w15142572.

[20] Y. Zhang, J. Ma, S. Liang, X. Li, and M. Li, “An evaluation of eight machine learning regression algorithms for forest aboveground biomass estimation from multiple satellite data products,” Remote Sens., vol. 12, no. 24, pp. 1–26, Dec. 2020, doi: 10.3390/rs12244015.

[21] L. Lin, J. Zhang, N. Zhang, J. Shi, and C. Chen, “Optimized LightGBM Power Fingerprint Identification Based on Entropy Features,” Entropy, vol. 24, no. 11, Nov. 2022, doi: 10.3390/e24111558.

[22] X. Sun, M. Liu, and Z. Sima, “A novel cryptocurrency price trend forecasting model based on LightGBM,” Financ. Res. Lett., vol. 32, Jan. 2020, doi: 10.1016/j.frl.2018.12.032.

[23] I. D. Mienye and Y. Sun, “A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects,” 2022, Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/ACCESS.2022.3207287.

[24] Y. Y. Li, T. Van Do, and H. T. Nguyen, “A comparison of forecasting models for the resource usage of MapReduce applications,” Neurocomputing, vol. 418, pp. 36–55, 2020, doi: 10.1016/j.neucom.2020.07.059.

[25] D. Preuveneers, I. Tsingenopoulos, and W. Joosen, “Resource usage and performance trade-offs for machine learning models in smart environments,” Sensors (Switzerland), vol. 20, no. 4, 2020, doi: 10.3390/s20041176.