Comparison of Data Mining Classification Algorithms for Stroke Disease Prediction Using the SMOTE Upsampling Method

Ronald Sebastian, Christina Juliane

Abstract


Stroke is a circulation disorder in the brain that can cause symptoms and signs related to the affected part of the brain and is the leading cause of death and disability in Indonesia. Everyone is at risk of experiencing a stroke, and it is important to recognize and manage risk factors. Data Mining techniques can help in the extraction and prediction of information, as well as finding hidden patterns in stroke medical data. The dataset used in this research comes from Kaggle and is imbalanced, so the SMOTE Upsampling technique is used to address this imbalance issue. The results of the study conclude that the use of SMOTE technique in the C4.5, NB, and KNN algorithms can increase precision, recall, and AUC. The C4.5 algorithm and SMOTE technique as the best performing algorithm were selected for testing new data, and the results show that the model created can predict stroke risk more accurately than the C4.5 model without SMOTE. However, it should be noted that based on the author's interview with one of the medical practitioners, the model cannot be directly used in medical practice because the observations in the medical field to determine factors related to stroke are highly complex. Thus, a new understanding revealed that predicting stroke in a practical setting is highly complex. While data mining can be used as a predictive tool in the initial stage for predictions in the general population, it is strongly recommended to undergo direct examination by doctors in a hospital to obtain more accurate and comprehensive medical evaluations.


Keywords


SMOTE Upsampling, K-Nearest Neighbour, Naïve Bayes, C4.5, Stroke

References


[1] P2PTM Kemenkes RI, “Germas Cegah Stroke - Direktorat P2PTM,” P2PTM Kemenkes RI, 2017, Diakses: 11 Oktober 2022. [Daring]. Tersedia pada: http://p2ptm.kemkes.go.id/tag/germas-cegah-stroke

[2] A. Yonata dan A. S. P. Pratama, “Hipertensi sebagai Faktor Pencetus Terjadinya Stroke,” 2016.

[3] L. (Lannywati) Ghani, D. (Delima) Delima, dan L. K. (Laurentia) Mihardja, “Faktor Risiko Dominan Penderita Stroke di Indonesia,” Indones. Bull. Heal. Res., vol. 44, no. 1, hal. 20146, Mei 2016, doi: 10.22435/BPK.V44I1.4949.49-58.

[4] Y. Mardi, “Data Mining : Klasifikasi Menggunakan Algoritma C4.5,” Edik Inform., vol. 2, no. 2, hal. 213–219, Feb 2017, doi: 10.22202/EI.2016.V2I2.1465.

[5] S. Uddin, A. Khan, M. E. Hossain, dan M. A. Moni, “Comparing different supervised machine learning algorithms for disease prediction,” BMC Med. Inform. Decis. Mak., vol. 19, no. 1, Des 2019, doi: 10.1186/S12911-019-1004-8.

[6] S. Mutmainah, “PENANGANAN IMBALANCE DATA PADA KLASIFIKASI KEMUNGKINAN PENYAKIT STROKE,” J. Sains, Nalar, dan Apl. Teknol. Inf., vol. 1, no. 1, Agu 2021, Diakses: 10 November 2022. [Daring]. Tersedia pada: https://journal.uii.ac.id/jurnalsnati/article/view/20060

[7] X. W. Liang, A. P. Jiang, T. Li, Y. Y. Xue, dan G. T. Wang, “LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM,” Knowledge-Based Syst., vol. 196, hal. 105845, Mei 2020, doi: 10.1016/J.KNOSYS.2020.105845.

[8] J. Han, M. Kamber, dan J. Pei, Data Mining: Concepts and Techniques. Elsevier Inc., 2012. doi: 10.1016/C2009-0-61819-5.

[9] D. T. Larose, “Discovering Knowledge in Data: An Introduction to Data Mining: Second Edition,” Discov. Knowl. Data An Introd. to Data Min. Second Ed., vol. 9780470908, hal. 1–316, Jul 2014, doi: 10.1002/9781118874059.

[10] A. Samosir, M. S. Hasibuan, W. E. Justino, dan T. Hariyono, “Komparasi Algoritma Random Forest, Naïve Bayes dan K- Nearest Neighbor Dalam klasifikasi Data Penyakit Jantung,” Pros. Semin. Nas. Darmajaya, vol. 1, no. 0, hal. 214–222, Sep 2021, Diakses: 19 September 2022. [Daring]. Tersedia pada: https://jurnal.darmajaya.ac.id/index.php/PSND/article/view/2955

[11] H. Dahan, S. Cohen, L. Rokach, dan O. Maimon, “Proactive Data Mining with Decision Trees,” 2014, doi: 10.1007/978-1-4939-0539-3.

[12] H. LEIDIYANA, “Komparasi Algoritma Klasifikasi Data Mining Dalam Penentuan Resiko Kredit Kepemilikan Kendaraan Bemotor,” 2013. doi: 10.31294/P.V15I2.6349.

[13] R. A. Barro, I. D. Sulvianti, dan F. M. Afendi, “PENERAPAN SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE) TERHADAP DATA TIDAK SEIMBANG PADA PEMBUATAN MODEL KOMPOSISI JAMU,” Xplore J. Stat., vol. 1, no. 1, Apr 2013, doi: 10.29244/XPLORE.V1I1.12424.

[14] “Mckinsey & Company Intelligence Benefits Society A Conversation – Management,” 2020. https://bbs.binus.ac.id/management/2020/03/mckinsey-company-intelligence-benefits-society-a-conversation/ (diakses 3 April 2023).

[15] M. S. Pathan, Z. Jianbiao, D. John, A. Nag, dan S. Dev, “Identifying Stroke Indicators Using Rough Sets,” IEEE Access, vol. 8, hal. 210318–210327, 2020, doi: 10.1109/ACCESS.2020.3039439.

[16] “McKinsey Analytics Online Hackathon - Healthcare Analytics,” 2018. https://datahack.analyticsvidhya.com/contest/mckinsey-analytics-online-hackathon/#ProblemStatement (diakses 3 April 2023).

[17] M. U. Emon, M. S. Keya, T. I. Meghla, M. M. Rahman, M. S. Al Mamun, dan M. S. Kaiser, “Performance Analysis of Machine Learning Approaches in Stroke Prediction,” Proc. 4th Int. Conf. Electron. Commun. Aerosp. Technol. ICECA 2020, hal. 1464–1469, Nov 2020, doi: 10.1109/ICECA49313.2020.9297525.

[18] S. J. Murphy dan D. J. Werring, “Stroke: causes and clinical features,” Medicine (Abingdon)., vol. 48, no. 9, hal. 561, Sep 2020, doi: 10.1016/J.MPMED.2020.06.002.

[19] K. K. Andersen dan T. S. Olsen, “Stroke case-fatality and marital status,” Acta Neurol. Scand., vol. 138, no. 4, hal. 377–383, Okt 2018, doi: 10.1111/ANE.12975.

[20] M. K. Kapral dkk., “Rural-urban differences in stroke risk factors, incidence, and mortality in people with and without prior stroke: The CANHEART stroke study,” Circ. Cardiovasc. Qual. Outcomes, vol. 12, no. 2, Feb 2019, doi: 10.1161/CIRCOUTCOMES.118.004973.

[21] Y. Huang dkk., “Association between job strain and risk of incident stroke,” Neurology, vol. 85, no. 19, hal. 1648–1654, Nov 2015, doi: 10.1212/WNL.0000000000002098.

[22] “Centers for Disease Control and Prevention,” 21 Februari 2023. https://www.cdc.gov/about/organization/cio.htm (diakses 5 April 2023).

[23] D. Mozaffarian dkk., “Heart disease and stroke statistics-2016 update a report from the American Heart Association,” Circulation, vol. 133, no. 4, hal. e38–e48, Jan 2023, doi: 10.1161/CIR.0000000000000350.

[24] S. K. Handayani, Dr. Fitria, SKp., MKep., S. K. K. Widyastuti, Rita Hadi SKp., MKep., dan M. E. Eridani, Dania ST., “Buku Panduan Penatalaksanaan stroke,” hal. 1–66, 2019.

[25] K. M. Rexrode, T. E. Madsen, A. Y. X. Yu, C. Carcel, J. H. Lichtman, dan E. C. Miller, “The Impact of Sex and Gender on Stroke,” Circ. Res., vol. 130, no. 4, hal. 512–528, Feb 2022, doi: 10.1161/CIRCRESAHA.121.319915.

[26] M. Wajngarten dan G. Sampaio Silva, “Hypertension and Stroke: Update on Treatment,” Eur. Cardiol. Rev., vol. 14, no. 2, hal. 111, 2019, doi: 10.15420/ECR.2019.11.1.

[27] W. Kim dan E. J. Kim, “Heart Failure as a Risk Factor for Stroke,” J. Stroke, vol. 20, no. 1, hal. 33, Jan 2018, doi: 10.5853/JOS.2017.02810.

[28] R. Chen, B. Ovbiagele, dan W. Feng, “Diabetes and Stroke: Epidemiology, Pathophysiology, Pharmaceuticals and Outcomes,” Am. J. Med. Sci., vol. 351, no. 4, hal. 380, Apr 2016, doi: 10.1016/J.AMJMS.2016.01.011.

[29] A. Bardugo dkk., “Body Mass Index in 1.9 Million Adolescents and Stroke in Young Adulthood,” Stroke, vol. 52, no. 6, hal. 2043–2052, Jun 2021, doi: 10.1161/STROKEAHA.120.033595.

[30] R. S. Shah dan J. W. Cole, “Smoking and stroke: the more you smoke the more you stroke,” Expert Rev. Cardiovasc. Ther., vol. 8, no. 7, hal. 917, 2010, doi: 10.1586/ERC.10.56.

[31] R. S. Rohman, R. A. saputra, dan D. A. Firmansaha, “Komparasi Algoritma C4.5 Berbasis PSO Dan GA Untuk Diagnosa Penyakit Stroke,” CESS (Journal Comput. Eng. Syst. Sci., vol. 5, no. 1, hal. 155–161, Jan 2020, Diakses: 11 November 2022. [Daring]. Tersedia pada: https://jurnal.unimed.ac.id/2012/index.php/cess/article/view/15225

[32] E. Dritsas dan M. Trigka, “Stroke Risk Prediction with Machine Learning Techniques,” Sensors, vol. 22, no. 13, Jul 2022, doi: 10.3390/s22134670.


Full Text: PDF

DOI: 10.30595/juita.v11i2.17348

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

ISSN: 2579-8901