Comparison of Data Mining Classification Algorithms for Stroke Disease Prediction Using the SMOTE Upsampling Method

- Stroke is a circulation disorder in the brain that can cause symptoms and signs related to the affected part of the brain and is the leading cause of death and disability in Indonesia. Everyone is at risk of experiencing a stroke, and it is important to recognize and manage risk factors. Data Mining techniques can help in the extraction and prediction of information, as well as finding hidden patterns in stroke medical data. The dataset used in this research comes from Kaggle and is imbalanced, so the SMOTE Upsampling technique is used to address this imbalance issue. The results of the study conclude that the use of SMOTE technique in the C4.5, NB, and KNN algorithms can increase precision, recall, and AUC. The C4.5 algorithm and SMOTE technique as the best performing algorithm were selected for testing new data, and the results show that the model created can predict stroke risk more accurately than the C4.5 model without SMOTE. However, it should be noted that based on the author's interview with one of the medical practitioners, the model cannot be directly used in medical practice because the observations in the medical field to determine factors related to stroke are highly complex. Thus, a new understanding revealed that predicting stroke in a practical setting is highly complex. While data mining can be used as a predictive tool in the initial stage for predictions in the general population, it is strongly recommended to undergo direct examination by doctors in a hospital to obtain more accurate and comprehensive medical evaluations.


I. INTRODUCTION
Stroke is the leading cause of death and disability in Indonesia.Everyone, regardless of age, is at risk of experiencing a stroke [1].According to the World Health Organization (WHO): Task Force in Stroke and other Cerebrovascular Disease, stroke is an acute neurological dysfunction caused by a rapid (within seconds) or, at the latest, within hours, abnormality in blood circulation, with symptoms and signs related to the specific part of the brain affected [2].Meanwhile, Riskesdas (2018) defines stroke as sudden, gradual, and rapid brain damage due to non-traumatic disturbances in brain blood circulation.This condition causes rapid symptoms, such as facial or limb paralysis, unclear or slurred speech, changes in consciousness, visual disturbances, and others.Globally, stroke has become more common in the last 20 years, as reported by Mukherjee in the article "Dominant Risk Factors for Stroke Patients in Indonesia" [3].WHO estimates that by 2025, the annual number of Europeans affected by stroke will increase from 1.1 million in 2000 to 1.5 million.Indonesia is no exception, as Riskesdas found in 2018 that 10.9% of all deaths in Indonesia were due to stroke.
Stroke incidence continues to increase in Indonesia, and the risk factors for this disease are crucial to be recognized and managed as soon as possible to prevent further damage and death.Unfortunately, the number of specialist doctors in Indonesia is still limited, and only 20% of Indonesians know the signs and symptoms, so many people wait too long before bringing stroke patients to the hospital.According to an article by Dr. Nanda L Prasetya (2020) cited on the website https://sippn.menpan.go.id/, if patients receive proper care, the effects of mild strokes can usually be managed in less than 10 minutes, and 90% can be reduced within less than four hours.Based on the above explanation, it can be concluded that by knowing and understanding the factors that cause stroke, support can be provided to take effective preventive measures to prevent stroke in the future [3].
The medical industry urgently needs a reliable and fast automated computerized system to provide a diagnosis of the causes and patterns of stroke.This is why it is crucial to maintain a record of data for each patient.The collected data can be used as a source to predict the likelihood of stroke in the future.Therefore, Data Mining techniques play a crucial role in extracting and predicting information and discovering hidden patterns in stroke medical data.
Data Mining is a method that utilizes statistics, mathematics, artificial intelligence, and machine learning techniques to extract and uncover information and knowledge that can be used from large databases.The practice of Data Mining refers to a set of processes used to extract previously unknown knowledge from a dataset [4].The Data Mining methods that the writer will use are the Decision Tree (C4.5) algorithm, the Naïve Bayes (NB) algorithm, and the K-Nearest Neighbor (KNN) algorithm.
The writer chose KNN, C4.5, and NB because these methods have reliable advantages in classifying data.As explained by [5] in the journal "Comparing Different Supervised Machine Learning Algorithms for Disease Prediction" published by the National Library of Medicine, Naive Bayes has advantages such as the ability to handle discrete and continuous data, can make probabilistic predictions, and requires less training data.KNN has advantages such as the ability to classify instance data quickly and can handle instance data with noise or missing attribute values.Meanwhile, C4.5 has advantages in the classification tree, which is easier to understand and interpret, and supports multiple data types such as numeric, nominal, and categorical.
The problem of imbalanced class distribution data often occurs in medical data [6].This can occur when the number of data in the majority and minority classes is unbalanced, which can cause errors in classification.In the case of stroke patient data, an imbalance in data between classes can cause misdiagnosis and inappropriate treatment.Therefore, it is necessary to study how to handle the problem of imbalanced class data in the medical world, especially in stroke patient cases discussed in this research.
One technique that can be used to handle imbalanced class data problems is Synthetic Minority Over-sampling Technique (SMOTE).This technique is an oversampling technique that synthetically adds new samples to the minority class so that the number of samples in both classes becomes balanced.In this sense, SMOTE can increase classification accuracy and improve the results of patient diagnosis and treatment [7].
Based on the above background, the writer will conduct research entitled "Comparison of Data Mining Classification Algorithms for Stroke Disease Prediction Using the SMOTE Upsampling Method".The problem statement in this study is as follows: 1).What is the comparison of accuracy between C4.5, NB, and KNN algorithms using SMOTE Upsampling technique and without it?, 2 ).What is the accuracy of C4.5, NB, and KNN algorithms using SMOTE Upsampling technique in predicting stroke?.The objectives of this study are as follows: 1).To compare the classification performance of datasets using SMOTE Upsampling technique and without it, 2).To determine the best performance of the three Data Mining classification algorithms (C4.5, NB, and KNN) in predicting the likelihood of stroke using SMOTE Upsampling technique.

A. Data Mining
Data mining is a process of identifying and extracting relevant information from large datasets using statistical, mathematical, artificial intelligence, and machine learning methods.It helps in discovering new and substantial information from databases and assists in decision-making for the future by finding important patterns in large databases.According to [8], data mining is the most crucial stage because it can reveal hidden patterns in data.Data mining has several uses, such as description, estimation, prediction, classification, clustering, and association.It can be included in a problem-solving strategy called CRIPS-DM (The Cross-Industry Standard Process for Data Mining), as explained by [9].CRIPS-DM has a life cycle consisting of six phases, where each phase depends on the results of the previous phase.Figure 1 shows the CRIPS-DM cycle, with adaptive arrows indicating the relationship between each stage.Based on the above Fig. 1, the CRISP-DM cycle consists of 6 phases [9], namely: business/research understanding phase, data understanding phase, data preparation phase, modeling phase, evaluation phase, and deployment phase.

B. Naïve Bayes (NB)
Naive Bayes is one of the popular classification methods used in machine learning.This algorithm is based on the Bayes theorem developed by the English mathematician, Thomas Bayes.This algorithm is used to determine the probability of a class or category based on the given data.The advantage of this method is its low complexity and requiring little data for training [10].The Bayes theorem has a general form like (1).
Explanation: y : data with an unknown class x : hypothesis that data y belongs to a specific class P(x|y) : probability of hypothesis x based on condition y (posterior probability) P(x) : probability of hypothesis x (prior probability) P(y|x) : probability of y based on hypothesis x P(y) : probability of y Naive Bayes is a simplification of Bayes' Theorem.The following is the simplified formula of Naive Bayes as (2).
C. Decision Tree (C4.5)One of the most common methods for representing classification is C4.5.C4.5 is an algorithm used in machine learning to build decision tree models based on available data.C4.5 is a variant of the Decision Tree algorithm developed by computer expert J. R. Quinlan.C4.5 has several advantages over other algorithms, including the ability to handle data with continuous (numeric) and discrete (categorical) attributes, handle data with missing attribute values, create models that are more accurate and consistent than other Decision Tree algorithms, and handle heterogeneous and non-scaled data.According to [11], many branches of science have conducted in-depth research on the problem of decision tree construction from available data, including statistics, machine learning, pattern recognition, and data mining.There are several steps that can be taken to build a decision tree, one of which is to use the C.5 algorithm [9].The steps are as follows: 1. First, collect training data.Training data usually comes from historical data, also known as previous data, which has been categorized into specific classifications.
2. Calculate the root of the tree.The root will be taken from the attribute that will be selected, and the initial root will be determined by summing the gain values of all attributes.First, calculate the entropy value, then the gain value of the attribute.The (3) formula is used to calculate the entropy value: Explanation: S : set of cases n : number of partitions in S pi : proportion of Si with respect to S 3. Then calculate the gain value using the (4) formula.
Entropy (Si) (4) Explanation: S: set of cases A: attribute n: number of partitions in attribute A |Si|: number of cases in partition i |S|: number of cases in S 4. Continue using steps 2 and 3 until all records have been partitioned.5.The decision tree partitioning process will stop when: a. all records in node N have the same class.b.There are no more attributes left to partition in the records.c.There are no records in the empty branch.

D. K-Nearest Neighbor (KNN)
According to [12], the KNN algorithm is a classification method that considers the attributes of training data to make predictions on new data.KNN stores all training data and compares the attributes of new data with the records in the training data to determine the class of the new data.This is an example of instancebased learning and includes case-based reasoning that handles symbolic data.This algorithm is also an example of lazy learning techniques, which wait until a question is asked before processing the training data.The formula for calculating the distance between two cases is: ,  )      (5) Explanation: T: new case S: case in storage n: number of attributes in each case i: individual attribute between 1 and n f: attribute similarity function between case T and S w: weight given to the i-th attribute

E. SMOTE Upsampling
The common issue in medical data is the uneven distribution of data between classes [6].Misclassification can occur if there is an imbalance between the major and minor classes.When there is an imbalance, the classifier will default to the major class, which can result in misdiagnosis and mistreatment of patients.Therefore, understanding the issue of imbalanced data is crucial in the medical field.
To address the imbalance in the number of objects in two data classes, the Synthetic Minority Oversampling Technique (SMOTE) can be applied.The major class is the data class with the most objects, while the minor class is the other class.The results of models built using imbalanced data can have a significant negative impact on processing outcomes.Imbalanced data is often overlooked by processing algorithms, which can cause the major class to dominate the minor class.
According to [13], the SMOTE approach is an alternative to oversampling strategies previously used to address imbalanced data problems.The SMOTE method differs from traditional oversampling methods in that it generates synthetic data by linking data from the minor class with neighboring data from the major class.In this way, the number of data from the minor class can be increased to match the number of data from the major class, achieving data balance.The SMOTE approach is very useful in cases where imbalanced data causes poor model performance.
The KNN method is used to create synthetic or fabricated data.The number of KNN is determined for ease of implementation.Synthetic data is generated with different numerical scales from categorical ones.Euclidean distance is used as a benchmark when working with numerical data, while mode is a simpler metric when working with categorical data.The Value Difference Metric (VDM) formula ( 6) is used to determine the distance between subclass samples where the variable is on a categorical scale [13].

G. Research Steps
This section will explain the method used by the author in completing this thesis report.The research methodology used can assist the author in conducting research from start to finish, so that the thesis report can be organized neatly and systematically.In summary, the research methodology used by the author is described in Fig. 2.

H. Data Understanding
The data understanding phase aims to analyze the collected data.In this study, the data collection was obtained from the Kaggle website, as explained in subsection F. The dataset used is titled "Stroke Prediction Dataset" and consists of 5110 data instances with 12 attributes (Table I).
The 11 attributes mentioned in Table I are supported by several medical journals, which explain that these attributes can be factors in the occurrence of stroke.One medical journal that supports the 11 attributes is a journal written by [18].,The journal was published on the National Center for Biotechnology Information (NCBI) website and explains that age, gender, hypertension, history of heart disease, high blood glucose levels, and body mass index (BMI) are important risk factors in the occurrence of stroke.
According to a medical journal written by [19], it shows that unmarried, divorced, and widowed individuals have lower death rates within 1 week and 1 month after a stroke compared to married individuals.The study was conducted on 60,507 stroke patients in Denmark during the period of 2003-2012.The "mortality displacement" factor associated with shorter life expectancy in unmarried, divorced, and widowed individuals may explain the research findings.The study explains that the attribute "ever_married" can have an influence on stroke incidence.A medical journal written by [20], published on the American Heart Association (AHA) website, explains that based on available data, the rate of stroke and strokerelated deaths is higher in rural populations compared to urban populations.Vascular risk factors such as hypertension, diabetes mellitus, smoking, and atrial fibrillation are more frequently found in rural populations and are less controlled.Additionally, other factors such as obesity, sedentary lifestyle, alcohol consumption, dietary patterns, and social deprivation also influence the stroke rate in rural areas.Therefore, it can be concluded that "resident_type" (type of residence) can be a factor in stroke occurrence, and better management of vascular risk factors is needed in rural areas.
A medical journal written by [21], published in the Neurology Journal, explains that occupation can contribute to an increased risk of stroke.Jobs with high levels of stress (high strain jobs) are associated with an increased risk of stroke, especially ischemic stroke.The results are more significant in women than in men.However, active or passive job characteristics are not associated with an increased risk of stroke compared to jobs with low stress levels.
Furthermore, before entering the data preparation stage, the author conducted an interview with a doctor from a hospital in Bandung City.The interview was conducted to validate the attributes taken from the Kaggle website with the attributes commonly used in the field.As a result, the doctor provided 7 recommended attributes commonly used in the field and also found in the Kaggle dataset.These seven recommended attributes include gender, age, hypertension, heart_disease, avg_glucose_level, bmi, and smoking_status.
These seven recommended attributes are based on the latest data from the CDC, which serves as a guideline.The CDC [22] is a data-driven and science-based service organization in the United States that protects public health.CDC has been in operation for over 70 years and has put science into action to help children stay healthy so they can grow and learn, assist families, businesses, and communities in fighting diseases and staying strong, and protect public health.
The latest CDC data [23] includes 13 attributes that can be risk factors for stroke, including 1) previous stroke or transient ischemic attack (TIA), 2) high blood pressure, 3) high cholesterol, 4) heart disease, 5) diabetes, 6) obesity, 7) sickle cell disease, 8) genetics and family history, 9) age, 10) sex, 11) race or ethnicity, 12) not getting enough physical activity, and 13) lifestyle (tobacco use, not getting enough physical activity, alcohol, eating a diet high in saturated fats, trans fat, and cholesterol).This data is also in line with the stroke implementation guidelines written by [24], which explain the 11 attributes that can be potential risk factors for stroke, including high blood pressure, diabetes, coronary heart disease, alcohol consumption, high cholesterol, smoking habits, obesity, blood clotting disorders, stress, lack of physical activity, and unchangeable risk factors such as advanced age (>60 years) and genetics.The explanations above explain that these seven recommended attributes are included in the 13 and 11 attributes previously described as potential risk factors for stroke.The results of the interview with the doctor are based on the recommended journals provided as data references to strengthen some of the recommended attributes, namely: gender [25], age [23], hypertension [26], heart_desease [27], avg_glucose_level [28], bmi [29], and smoking_status [30].

A. Data Preprocessing
Data preprocessing is performed to be used in the next steps.This preprocessing involves selecting patient data that has complete information regarding smoking status.The ID attribute is not used in this process because it is not relevant as a determinant of stroke disease.Thus, the purpose of this step is to generate a dataset consisting of patient data that has the necessary information for stroke prediction analysis.After preprocessing, the stroke patient data is reduced to 4024 records, with a difference of 3843 records for suffered_stroke and 181 records for no_stroke.The data to be used consists of 11 attributes that determine the stroke disease, with 10 predictor attributes and 1 target attribute.This is done to ensure that the data used in the next stage is of good quality and ready to be used for building an accurate stroke disease prediction model.After going through data preprocessing, the dataset is divided into two for testing in RapidMiner.The first dataset will be modeled without SMOTE, while the second dataset will apply the SMOTE method to generate balanced fabricated data between suffered_stroke and no_stroke.The output result of the dataset after using the SMOTE Upsampling method amounts to 7686 records, with 3843 records for suffered_stroke and 3843 records for no_stroke (Table II).In the development of models in the medical field, accuracy, precision, recall, F1 score, and AUC value are some important metrics to evaluate the performance of the model.However, having high accuracy does not always indicate that the model built has good performance in predicting results in medical data.Therefore, it is important to consider other metrics such as precision, recall, F1 score and AUC value to evaluate the overall performance of the model.
The differences between NB, C4.5, and KNN algorithms in testing using SMOTE technique and without SMOTE technique indicate that this technique can improve the model's performance in handling imbalanced medical data.The test results show that using the SMOTE technique can yield significant differences in several evaluations, such as class precision, recall, F1 score, and AUC values for each algorithm.The NB algorithm has a difference of 53.48% for class precision, 63.13% for recall, 58.28% for F1 score, and a difference of 0.02 for AUC value when using the SMOTE technique compared to testing without the SMOTE technique.The C4.5 algorithm has a difference of 80.93% for class precision, 79.34% for recall, 80.11% for F1 score, and 0.37 for AUC value when using the SMOTE technique.On the other hand, the KNN algorithm has a relatively large difference, which is 83.53% for class precision, 59.14% for recall, 70.31% for F1 score, and 0.28 for AUC value when using the SMOTE technique.
In the development of models in the medical field, SMOTE technique can be used to handle imbalanced data and improve the performance of the model in predicting results.In the journal "Stroke Risk Prediction with Machine Learning Techniques" written by [32] and published in the National Library of Medicine, it is explained that class balance is very important in designing effective methods for predicting stroke, one of which is by using the Synthetic Minority Over-Sampling Technique (SMOTE).The journal also explains that when dealing with imbalanced data, metrics such as precision and recall are more suitable for identifying model errors.Precision measures how many of the patients who actually had a stroke are included in this class, while recall measures how many of the patients who had a stroke are correctly predicted.
The two metrics, precision, and recall can also affect the AUC value on the ROC curve.The closer the AUC value is to one, the better the performance of the machine learning model in distinguishing between patients who have had a stroke and those who have not.Therefore, in addition to considering accuracy, it is also important to pay attention to other metrics such as precision, recall, and AUC value, as well as using techniques like SMOTE to improve model performance so that prediction results can be relied upon and useful in medical decisionmaking in cases of data imbalance.
Based on the comparison results of the data mining algorithm testing to predict stroke disease in Table III, the Decision Tree C4.5 algorithm showed the best accuracy, recall, precision, F1 Score, and AUC value compared to Naïve Bayes and K-Nearest Neighbor.Therefore, new data testing needs to be done to predict stroke disease to maximize testing.The new data used in this analysis testing is taken from one of the hospitals in Bandung city different different from the hospital where the authors conducted interviews with medical practitioners.The C4.5 algorithm can be chosen to obtain accurate and reliable prediction results as the best algorithm among the three selected algorithms.Fig. 3 shows the new data of stroke patients to be tested for prediction, where the stroke patients have suffered_stroke disease.
In the testing conducted using the C4.5 algorithm with the SMOTE method and without it on the new data from one of the hospitals in Bandung, the results have been documented in the Table IV.From the testing results, it is proven that the use of the SMOTE method in the C4.5 algorithm can produce accurate predictions of stroke disease compared to the C4.5 algorithm without SMOTE.
Furthermore, the author also validated the results by conducting follow-up interviews with doctors who provided recommendations regarding the attributes.The findings in this context can be summarized as follows: although the developed model successfully predicts stroke accurately, it should be noted that the model cannot be directly used in medical practice because the observations in the medical field to determine factors related to stroke are highly complex.Thus, a new understanding revealed that predicting stroke in a practical setting is highly complex.While data mining can be used as a predictive tool in the initial stage to make predictions for the general population, it is strongly recommended to seek direct examination by doctors in a hospital to obtain more accurate medical evaluations.

IV. CONCLUSION
This study compared the performance of three data mining classification algorithms (Naïve Bayes, Decision Tree C4.5, and K-Nearest Neighbor) in predicting stroke disease.Two different datasets were used to test the algorithms' performance, one with the application of the SMOTE technique and one without it.The results showed that the use of the SMOTE technique improved the precision, recall, F1 score, and AUC for all three algorithms, although the accuracy was slightly lower compared to the data without SMOTE.The C4.5 algorithm with the SMOTE technique demonstrated the best performance.Therefore, C4.5 was chosen to predict the data of new stroke patients obtained from one of the hospitals in Bandung.The use of C4.5 with the SMOTE technique proved to predict with higher accuracy than C4.5 without SMOTE.However, it should be noted that based on the author's interview with one of the medical practitioners, the model cannot be directly used in medical practice because the observations in the medical field to determine factors related to stroke are highly complex.Thus, a new understanding revealed that predicting stroke in a practical setting is highly complex.While data mining can be used as a predictive tool in the initial stage for predictions in the general population, it is strongly recommended to undergo direct examination by doctors in a hospital to obtain more accurate and comprehensive medical evaluations.

Fig.
Fig. 3 Testing new stroke patient data Choose the majority value among the main vector and its k-nearest neighbors for the nominal value.If there is a tie, choose randomly.b.Make the chosen value as the new artificial class example.

TABLE III COMPARISON
OF NB, C4.5, AND KNN ALGORITHM TESTING RESULTS USING SMOTE AND WITHOUT SMOTE

TABLE IV TESTING
NEW STROKE PATIENT DATA USING C4.5 ALGORITHM WITH SMOTE METHOD Category