Resampling Technique for Imbalanced

.


I. INTRODUCTION
Educational data mining is an emerging field in data mining.The need for accuracy in identifying student accomplishment on a course or maybe an upcoming course can help the institution to build technology-aided education better [1].Educational data mining becoming a more important field to be studied because of its potential to produce a knowledge base model to help even the teacher or lecturer [2].Several studies have been conducted in the educational data mining field such as [3] which predicts the drop-out potential of a student using the Random Forest classification method, and another study conducted by [2] which measures the potential of an engineering student success based on their first three-year study performance.Other studies such as [1], [4]- [7] are several studies that focus on educational data mining discussion.
Like another classification task, educational data mining has a common and frequently discovered problem.The problem that occurred in educational data mining specifically and classification tasks generally is an imbalanced class problem.An imbalanced class is a condition where the distribution of each class is not in the same proportion [8].The existence of imbalanced data can disrupt the classification performance as it makes the classification result biased toward the majority class.To make a fair classification process for each of the class(es), the imbalanced class needs to be overcome first.There are two mechanisms to handle imbalanced data, data level approach and algorithm approach [9].The data level approach allows the dataset to be altered in terms of volume, either it is trimmed or added using a resampling technique.The reduction and the addition is depending on which data resampling algorithm.The data-level approach has two kinds of resampling mechanisms that are oversampling and undersampling [10].Undersampling works by reducing the amount of data from the majority class to be balanced with the minority class [11].Otherwise, oversampling is adding synthetic data into the minority class to be balanced with the majority class [12].While undersampling potentially information trimming from the data because of the data reduction behavior, oversampling preserves the general existing information without any information reduced from the data.Synthesized data of the minority class is generated to preserve the characteristics of the original dataset on the certain class that have less amount of data from other class.
Student performance dataset possible to have imbalanced data according to the student's grade distribution in a course class.To preserve the general pattern and to make sure the classification's result is appropriate, the dataset will be undergoing a resampling step if there are any imbalanced class problems on the dataset.In this research, it is found that the class distribution is severely imbalanced and it is a multiclass dataset that consists of more than two class labels.According to the problem stated beforehand, this paper will focus on the imbalanced class handling and classification with several methods on both of it such as Linear Regression, Random Forest and Stacking for classification and SMOTE, ADASYN, and SMOTE-ENN for the resampling algorithm.The expected outcome from this research is the understanding of how those methods interact and performs toward this dataset, and whether the resampling method can improve the accuracy of the classification process in comparison with the no-resampling classification.

II. METHOD
This research was conducted using a private dataset of students' performance in several courses at the university, especially in the Information System program.Each student activity is used to determine their performance grade as the class of the dataset.The grading mechanism has done manually based on their score on the final exam and the activeness of the student.Each student is graded into three categories named, High, Middle, and Low representing how well they perform in the class.According to the distribution of the class after being labeled, the imbalanced class occurrence indication is strong in this dataset, so the imbalance class handling is being used to provide a better and fair classification of the data.This study follows the schematic diagrams shown in Fig. 1.

A. Dataset
This research use students' course results from three different courses at our university.The dataset has been labeled as "Low", "Medium", and "High" according to their class performance on each course that represents their class activities and exam scores.The dataset contains 298 records along with 7 predictor variables such as Course, Attendance(%), MidTerm, 1stAssessment, 2ndAssessment, Class_Activity, and one class attribute.The sample of the dataset can be seen in Table I.
From Table I can be seen that the dataset is a combination of numerical and nominal features.It needs to be encoded before being processed by any data mining method.The encoding steps will be carried out in the preprocessing steps by changing all of the nominal attributes into numerical using unique integers value such as "Yes -No" into 1 -0 and so on.Before undergoing any of the data mining steps, exploratory data analysis will be carried out on this dataset to get a better understanding of the data characteristics and probably find something important to be addressed.First, the distribution of each class will be visualized to know how the dataset is classified and how many records belong to each class.The distribution of the class can be seen in Fig. 2.

B. Preprocessing
According to Fig. 2, it can be seen that the class distribution is skewed or can be described as imbalanced distribution among the classes.Under the imbalanced situation, any classification method has the probability to be biased toward the majority class because it has more data that the other class which is less than the majority class.The imbalanced problem needs to be addressed before the dataset being process further into the classification step because the ratio of the majority and minority classes is significantly different.The class data counts for each class are High: 170, Low: 58, and Middle: 41.Even though the imbalanced condition can be harmless to the classification performance.It is important to make sure that the classification result gives its optimum result.Therefore, the imbalanced condition will be handled first using an appropriate mechanism which is resampling.
Resampling is a mechanism to relatively "balance" the dataset by either reducing the majority sample or synthesizing the minority sample to make the distribution of each class become relatively the same.the name of both mechanisms is undersampling and oversampling respectively.Both of the resampling mechanisms have their own advantages and disadvantages.But, oversampling is a more widely-used resampling mechanism according to studies on imbalanced data or even a hybrid one (oversampling method + undersampling method).
As for this study, three resampling techniques will be implemented namely, SMOTE [13], ADASYN [14], and SMOTE-ENN [13].The first two methods are oversampling and the last one is a hybrid method.By using different resampling methods, a comparison of which method performs better on the classification method can be expected.So that the justification of why one's method is preferably used on certain datasets' conditions can be supported.SMOTE works by synthesizing the minority sample of the data considering the neighborhood sample data of minority class(es).New data points are generated using (1).
where   is the random sample from k neighbor and  is a random number in the interval [0,1].ADASYN is also an oversampling method, it is a further development of SMOTE which works more adaptively to make the dataset relatively balanced.The steps of the ADASYN method are as follows [14] A combination of SMOTE with an undersampling gives better performance [15].One of the popular hybrid methods of oversampling-undersampling technique is SMOTE-ENN which was developed by [13].This method combines SMOTE's ability to generate a synthetic data sample of minority classes and the ability of ENN to delete some records from all classes that are identified as having different classes between the records and its k-nearest neighbor majority class.The process of SMOTE-ENN is as follows: 1. (Start of SMOTE) Choose random data from the minority class.2. Calculate the distance between the random data and its k-nearest neighbors.3. Multiply the difference with a random number between 0 and 1, then add the result to the minority class as a synthetic sample.4. Repeat steps 2-3 until the desired proportion of minority class is met.(end of SMOTE) 5. (Start of ENN) determine K, as the number of knearest neighbors.If not determined, then K=3. 6. Find the k-nearest neighbor of the record among the other record in the dataset, then return the majority class from the k-nearest neighbor.8.If the class of the observation and the majority class from the record's k-nearest neighbor is different, then the record and its k-nearest neighbor are deleted from the dataset.9. Repeat steps 6 and 7 until the desired proportion of the dataset is met.

C. Classification
Several classification methods are used in this research to show how each of them performs toward the dataset with different resampling mechanisms.The classification method used is Logistic Regression, Random Forest, and ensemble technique Stacking.
Logistic Regression is a classification model that works on the categorical type of dependent variable; can be 0 and 1, true and false, etc.The logistic regression equation is stated as ( 2) and (3).

𝐿𝑛 (
where: B0 = constant B1 = the coefficient of each variable The p value or probability (Y = 1) can be found by the equation: This equation can be used to calculate the probability of a respondent having a variable value that has been defined in the equation, the final result of the p-value will of course be in the range of 0 -1 [1], [16], [17].
Random Forest is a classification method that consists of a large number of individual decision trees that operate as an ensemble (Fig. 3).Each tree in the random forest produces a class prediction and the class with the most votes becomes the model's prediction [1], [18], [19].[18] The stacking technique allows several classification methods to work together and produce a better and stronger classification model.It learns in parallel manners, then combines them by training a meta-learner to output a prediction based on the other learner's prediction.A meta learner inputs the prediction as the features and the target being the ground truth values.It attempts to learn how the best combine the input predictions to make better output predictions [20]- [23].The illustration of how the stacking technique works can be seen in Fig. 4.

III. RESULT AND DISCUSSION
Experiments were carried out on the dataset that had been set beforehand and will be explained in this section.The experimental scenario is described in the following section.

A. Experimental setup
The validation model used in this study is 10-fold cross-validation and 80-20 random splitting ratio.Each of the combinations of classification and resampling methods will be measured by both evaluation metrics.The experiment schema will be shown in Table II.The overall performance is expressed by the F1 score.Accuracy is not used as a performance measurement due to bias to the class imbalance problem.

B. Experiment Result
In the pre-processing step, each of the nominal data will be transformed into a numerical will with a unique integer encoding type.Each data is transformed into numeric to enable the calculation process of the resampling and classification result.One of the results of the resampling steps can be seen in Fig. 5. Table III provides information on the classification results of each resampling method along wing the evaluation metric used in this study.
As shown in Table III, each method combination produces a different result.As a comparison, a noresampling classification is done to give a comparison to the usage of the resampling method toward the dataset.The most remarkable result was shown by the stacking algorithm along with the ADASYN resampling method on the 80:20 splitting evaluation metric with 0,97.The 10-fold stacking algorithm is on par with the Random Forest result at 0.90. the comparison between the resampled and no-resampling scores is not that far.It can be caused by several things such as the general pattern of the data for each class is already been good from the start.Thus, there is no real drawback if the original data is processed.The other reason is probably the method used in this research is not optimal yet, there is possibly a better resampling method that can be used for this kind of dataset.But one thing that shows is resampling technique improves the performance of the classification method to some extent.
To get a better understanding of the data, a feature importance calculation step is done under the Random Forest classification process.According to the feature importance calculation, 2 nd Assessment (feature 4) has the biggest score among the others.The feature importance visualization can be seen in Fig. 6.According to Fig. 6, can be seen that feature 4 (2nd Assessment) has the greatest importance score.It can be implied that the feature plays a great role to determine which class a student is included in.After the importance level is known.Further analysis of the data in terms of the relationship between the feature with a certain part of the class (Low Grade) is being done.The result of the analysis can be seen in Fig. 7.
From Fig 7 can be seen that the majority of the student in the "Low" grade class did not accomplish the 2nd Assessment.Referring to the feature importance score it can be implied that how a student performs in the class is most likely influenced by the accomplishment of the 2nd Assessment of the course.Another feature that is secondly important according to the feature importance is the MidTerm feature.The mean distribution of the MidTerm Score in each class is represented as shown in Fig. 8.

IV. CONCLUSION
Based on the result produced in this research, the stacking classification algorithm performs better than the other two classifiers (Linear Regression and Random Forest) with a 0.97 F1 score as the measurement of how well the classification under an imbalanced class is.The difference resulting from the classification process could be happened because of the nature of the classifier itself.
Stacking is theoretically superior to the other two classifiers because the stacking algorithm combines several single classifiers as one more powerful classifier.It means that the stacking algorithm compromises the drawbacks of the classifiers and enhances the advantage of those classifiers.The result of this study also shows that the resampling technique improves classification performance.Even though the no-resampling classification result produced a decent result too, it can be caused by several things such as the general pattern of the data for each class is already been good from the start.Thus, there is no real drawback if the original data is processed.The other reason is probably the method used in this research is not optimal yet, there is possibly a better resampling method that can be used for this kind of dataset.According to further data analysis after the data mining process, the feature importance process under the Random Forest classification method shows that the 2nd Assessment is the most important feature that mostly determines which class a student will be included.Referring to the finding, most of the students in the Low-grade class did not accomplish the 2nd Assessment and had a low score on the MidTerm feature.The MidTerm feature is the second important feature according to the feature importance process.Up to this point, how a student will perform in the class is only based on how they perform academically.It is possible to include their demographics or their daily life into consideration.It may give a better perspective of which kind of student will succeed in class and the one who will fail.The possible future research is adding more features from another aspect of the students.

Fig. 5
Fig. 5 Resampled data distribution using SMOTE Fig. 5 shows the resampling result of SMOTE, which balances the class distribution for each class to the same amount of data (170 data records).The X-axis value represents the encoded classes 0 for High, 1 for Middle, and 2 for Low.TableIIIprovides information on the classification results of each resampling method along wing the evaluation metric used in this study.As shown in TableIII, each method combination produces a different result.As a comparison, a noresampling classification is done to give a comparison to the usage of the resampling method toward the dataset.The most remarkable result was shown by the stacking algorithm along with the ADASYN resampling method on the 80:20 splitting evaluation metric with 0,97.The 10-fold stacking algorithm is on par with the Random Forest result at 0.90. the comparison between the resampled and no-resampling scores is not that far.It can be caused by several things such as the general pattern of the data for each class is already been good from the start.Thus, there is no real drawback if the original data is processed.The other reason is probably the method used in this research is not optimal yet, there is possibly a better resampling method that can be used for this kind of dataset.But one thing that shows is resampling technique improves the performance of the classification method to some extent.To get a better understanding of the data, a feature importance calculation step is done under the Random Forest classification process.According to the feature importance calculation, 2 nd Assessment (feature 4) has the biggest score among the others.The feature importance visualization can be seen in Fig.6.

Fig. 7 Fig. 8
Fig. 7 Student distribution of 2nd assessment on low class

TABLE I STUDENT
'S DATA No Course Attendance(%) MidTerm 1

. 2 Class distribution
where is the sample number that belongs to the main class and also the k-nearest neighbor of .4. Regularize  according to  = / ∑    , then  is the probability distribution, and ∑  = 1. 5. Count the number of samples xi in the minor class to produce  =  × . 6. Choose a sample of the k-nearest neighbors from the small class.Synthesize a new sample   , where   = ( + ) × ,  ∈ [0,1] is a random number.7. Repeat step 6, gi times to get the sample gi xi.