Voting Scheme Nearest Neighbors by Difference Distance Metrics Measurement

- K-Nearest Neighbor (KNN) is a widely used method for both classification and regression cases. This algorithm, known for its simplicity and effectiveness, relies primarily on the Euclidean formula for distance metrics. Therefore, this study aimed to develop a voting model where observations were made using different distance calculation formulas. The nearest neighbors algorithm was divided based on differences in distance measurements, with each resulting model contributing a vote to determine the final class. Consequently, three methods were proposed, namely k-nearest neighbors (KNN), Local Mean-based KNN, and Distance-Weighted neighbor (DWKNN), with an inclusion of a voting scheme. The robustness of these models was tested using umbilical cord data characterized by imbalance and small dataset size. The results showed that the proposed voting model for nearest neighbors consistently improved performance by an average of 1-2% across accuracy, precision, recall, and F1 score when compared to the conventional non-voting method.


I. INTRODUCTION
K-Nearest Neighbor (KNN) is a distance-based classification method whose operation includes identifying the nearest neighbors of the test data within the range of the training data.This proximity can be measured using a distance function, with Euclidean distance serving as the most prevalent choice.KNN has several advantages, including its simplicity, ease of explanation, and adaptability to irregular feature spaces.Over time, it has been subject to various modifications for performance improvement.However, the method has garnered attention in previous literature due to four problematic issues.Firstly, it is sensitive to the neighborhood size parameter k [1]- [3].The performance can deteriorate when outliers are present, whether k is set to a smaller or larger value.A selection of small k parameter often results in suboptimal classification outcomes, particularly in discrete and noisy datasets.Conversely, the selection of a large k parameter can lead to compromised classification outcomes, due to the influence of outliers.Secondly, KNN is sensitive to the distance function used for selecting k nearest neighbors [4]- [6].Thirdly, the method can be highly complex due to the search of nearest neighbor (NN) [7]- [9].This aspect provides a significant challenge as KNN is required to calculate the distances of all samples in order to identify the k nearest neighbors for each given query (test data).
The development of the nearest neighbors method has been widely carried out, with a focus on addressing the three identified weaknesses.Several studies aimed at improving the method have been previously proposed, particularly in addressing the sensitivity issue.One approach used incorporates a local mean factor to reduce the sensitivity effect of the k value.Various methods, such as k-harmonic nearest neighbors (KHNN) [10], local mean-based KNN (LMKNN) [11], local meanbased pseudo-NN (LPMNN) [12], and multi-local means-based NN (MLNN) [13], have been developed to reduce the impact of outliers around the sample points.Some other methods, such as pseudo nearest neighbors (PNN) [14], weighted representation-based KNN (WRKNN), and weighted local mean representationbased KNN (WLMRKNN) [15], introduced weights for each neighborhood data point.These weighting methods are based on the premise that each nearest neighbor contributes differently to the classification outcome.The development of the dual distance-weight technique has led to the introduction of the distance-weighted k-nearest neighbor rule (DWKNN) [16].The new method reduces the weight of each nearest neighbor, except the first closest and the k-th.Several alternative neighborhood methods have been successively applied to classification problems in order to address practical issues in KNN.For instance, the surrounding neighborhood-nearest centroid neighbor (NCN) was derived for finite sample-size situations, with extensions like KNCN [17] and LMKNCN [18] exhibiting satisfactory performance.The second key concern is the selection of a suitable distance metric for evaluating the distance between query and training samples, a crucial factor in classification decisions.To enhance the classification performance of KNN, several local and global feature weighting-based distance metrics methods have been developed [19], [20].However, these approaches often overlook the correlations between all training samples, signifying the importance of accurately defining a distance metric for KNN classification.Based on the fuzzy sets theory, fuzzy nearest neighbor classifiers that introduce fuzziness into KNN were proposed in [21]- [23].Some evidence-theoretic KNN classifiers have also been explored [24], [25] from the perspective of Dempster-Shafer theory.Derrac et al. recently conducted a comprehensive review of the most relevant algorithms for fuzzy nearest neighbor classification [26].In contrast to using all training samples in some of the algorithms, several prototype-based classifiers have emerged.These approaches, including the selection [27]- [29], generation [30], [31] and optimization [32]- [34] of prototype, leverage a few well-represented prototypes to achieve optimal classification performance, improving speed, storage, and accuracy.Based on the extensive literature on KNN development, the current study aimed to analyze the impact of different distance metric formulations, such as Euclidean, Minowski, and Manhattan distances.The proposed model adopted a voting scheme to these three distinct distance metrics, and was subsequently applied to the original KNN method, LMKNN, and DWKNN.The effect of the model was evaluated using umbilical cord data.The main contribution of this research is applying voting techniques to the three methods above using different distance metric methods.Voting is carried out with the same weight between one metric and another.

A. Voting Scheme for Nearest Neighbors
The proposed method in this study utilized voting calculations to determine the final class for the test data.The final prediction result of this model was determined by the highest number of votes received.A total of 12 models were analyzed for their performance, based on the differences in distance metric measurements, specifically Euclidean, Manhattan, and Minowski.The classifier algorithm used was the KNN method, along with its state-of-the-art developments, namely LMKNN and DWKNN.Fig. 1 shows the proposed model scheme based on the voting method.

B. Local Mean Based k-Nearest Neighbors (LMKNN)
Local Mean K-Nearest Neighbor (LMKNN) Classification extended the K-Nearest Neighbor algorithm and was specifically designed to address the sensitivity of KNN to outliers, specifically when training sample sizes are small.The basic concept behind the development of LMKNN is presented as follows: 1. Identification of the KNN for each class corresponding to the query sample.LMKNN was designed to mitigate the challenges of outliers and enhance KNN performance.By calculating the local mean for each class, LMKNN aimed to effectively capture the underlying data structure, particularly in cases of small training samples or in the presence of outliers.In summary, it was a classification method that leveraged the local mean of nearest neighbors to classify query samples and effectively address outlier issues in KNN as in (1).
Where TS ={ ∈ ℝ  } =1  is the training samples from a d-dimensional feature space, with N as the total number of samples, and  ∈ {1, 2, . . ., } indicates the class label for , with a total of M classes.TR={ ∈ ℝ  } =1  represents the subset of  corresponding to class , with Ni training samples.LMKNN follows these steps to classify the query sample  ∈  into a class : Step 1. Identify the KNN  set for each class  for the query pattern .
Step 2. Calculate the local mean vector    for class , using set    () as in (3).
Step 3. Assign  to class  when the mean distance between the local mean vector for  and the query sample falls within the minimum Euclidean space using (4).
LMKNN is equivalent to the 1-NN classifier when  = 1.The significance of K differed between KNN and LMKNN.KNN selected the K nearest neighbors from the entire training sample set, while LMKNN utilized the local mean vectors of K nearest neighbors within each class.LMKNN aimed to identify the class with the locally closest region to the query sample, effectively mitigating the negative impact of outliers, particularly in small sample sizes.

C. Distance-Weighted k-Nearest Neighbors (DWKNN)
DWKNN, an extension of KNN, was designed in order to reduce the sensitivity to the neighborhood size parameter k and achieve good pattern classification performance.Let  ̅ = {(    ,     )} =1  set of knearest neighbors to the query ̅ sorted in ascending order of their distances  (̅ ,     ) between ̅ and ̅    , and  ̅ = { ̅ 1 , . . . . .,  ̅  } being the corresponding set of dual weights.DWKNN built upon WKNN by assigning different weights to the k-nearest neighbors based on their distances, with a higher weight given to the nearest neighbors.It also assigned dual weights to the i-th nearest neighbor     of the query ̅ a dual weight  ̅  , which were determined by a dual distance-weighted function as in (5).The classification of the query̅ was determined through a majority weighted vote from the k nearest neighbors, as described by (6).
The dual weight calculation considered two components: the first part was similar to the weight in WKNN, while the second represented a newly determined weight, both based on the fundamental idea of the distance weighting scheme.The dual weights  ̅  is, as shown in Eq. ( 5), were generally smaller than the weights computed in WKNN, except for the first and kth nearest neighbors.Consequently, the corresponding neighbors     had a smaller influence on the query classification outcome.Dual weights decreased rapidly from 1 at the first nearest neighbor distance to 0 at the farthest k-th nearest neighbor distance.

A. Dataset Description
The umbilical cord, also known as the navel string, is a connecting tissue or channel establishing a link between the placenta and fetus.It serves as a lifeline, fulfilling several crucial roles, including maintaining the viability and growth of fetus, eliminating waste compounds, and transporting essential elements such as oxygen, nutrients, and antibodies.These factors collectively contribute to the optimal development of fetus in the womb.
The umbilical cord dataset comprised 19 distinct features and three classes, namely normal, hypercoiling, and hypocoiling.The dataset exhibited an imbalanced ratio (IR) of 6.3%, with a total of 63 data points, as shown in Table I.The testing phase was performed using a training data split of 70% and a testing data split of 30%.

B. Performance Metrics
The performance of the classification model was assessed using three metrics, including accuracy, precision, recall, and F-Measure.In machine learning classification tasks, these metrics were derived from the confusion matrix parameters, namely True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).These parameters served as the basis for computing other performance metrics such as Precision, Recall, and F1 scores.Accuracy measures the amount of correctly classified data points relative to the groundtruth label divided by the total data used for testing.Precision is the rate of accurate predictions among all samples predicted to belong to the minority class, and it indicates the number of accurate positive predictions.Also, recall reflects the proportion of minority class samples labeled as positive.Table II shows the formulas for measuring accuracy, precision, recall, and F-Measure.The F-measure parameter represented the harmonic mean of precision and recall.It was governed by the value of , ranging from 0 to 1.A higher  value indicated that the testing model prioritized the results of precision and vice versa.

C. Performance Analysis
The testing phase included the assessment of KNN, LMKNN, and DWKNN, utilizing three different distance metrics, namely Euclidean, Minowski, and Manhattan.In the first testing phase, the KNN method was evaluated with varying values of k (k = 1, 3, 5, 7, 9).Table III shows the testing results for the KNN voting algorithm.
The results showed that the KNN voting model attained its highest accuracy of 98.37% when k = 3, surpassing the conventional KNN method for Minowski, Manhattan, and Euclidean distances.In terms of precision performance, the KNN voting method also demonstrated highest achievement of 98.49% at k = 1 and 3.In the case of recall, all four models achieved the highest value of 98.34%.Similarly, in the F1 score, all four models attained a maximum result of 98.33%.The second testing phase continued with the DWKNN voting model, while DWKNN (Euclidean), DWKNN (Manhattan), DWKNN (Minowski), and DWKNN (Voting all) models were subsequently analyzed for optimal performance.Table IV shows the results of the second testing phase.In the second testing phase, the DWKNN voting algorithm demonstrated performance relatively similar to the DWKNN algorithm with Manhattan distance.The highest accuracy was achieved at 98.52% for k = 2, a result matched by DWKNN Manhattan at k = 5 and 7, also achieving a 98.52% accuracy rate.In terms of precision performance, the highest values were nearly identical for DWKNN Manhattan and DWKNN voting, both reaching 98.57 at k = 3.The recall values reached their peak at 98.48% for k = 3, 5, and 7.For the F1 score, DWKNN (Euclidean), DWKNN (Manhattan), and DWKNN (Voting all) all attained the same peak value of 98.36%.The third testing phase proceeded with a comparison of the performance results for Voting LMKNN, LMKNN (Euclidean), LMKNN (Manhattan), and LMKNN (Minowski) methods.Table V presents the results of the third testing phase In the third test, the DWKNN voting algorithm consistently outperformed the other three methods.It achieved the highest accuracy of 98.86% at k = 5, the highest precision of 98.70% at k = 5, the peak recall at 98.80% with k = 3, and the highest F1 score of 98.63% at k = 5.The tests showed that the proposed nearest neighbors voting algorithm could enhance performance.While the performance improvement might not be highly significant, the inclusion of voting enhanced the ability of the algorithm to identify a decision boundary, particularly in imbalanced data conditions.The average performance improvement in terms of accuracy, precision, recall, and F1 score fell within the range of 1-2%.The incorporation of voting, based on different distance measurements in the nearest neighbors algorithm, broadened the scope for methods to make final class decisions for test data. .

IV. CONCLUSION
In conclusion, this study developed a novel approach to enhance the KNN algorithm by incorporating distance-based metrics and implementing a voting scheme.The algorithms tested with the inclusion of voting based on distance metrics included KNN, LMKNN, and DWKNN.The evaluation was carried out on the umbilical cord dataset, characterized by its limited data volume and class imbalance.This was achieved using a 70:30 split for mechanism for training and testing data, with varying k-values for observations.The experimental results showed that the proposed method of nearest neighbors voting yielded an improvement of approximately 1% to 2% in terms of accuracy, precision, recall, and F1 score when compared to the non-voting approach.The integration of voting based on diverse distance measurements within the nearest neighbors algorithm provided a broader perspective for refining the process of making final class decisions from test data.For future developments, this study could consider addressing the relatively high computational time associated with the proposed method.The increased computational demands were a result of the various distance metric calculations and the inclusion of voting steps performed by the model for each test data point.

Fig. 2
until 5 show a plot graph of the performance of the nearest neighbors voting model.The figure shows a comparison of the performance of the KNN, LMKNN, and DWKNN methods and the proposed method with parameters of different k values.

Fig. 3 Fig. 4 Fig. 5
Fig. 3 Plot accuracy performance of proposed method in umbilical cord dataset

TABLE V RESULTS
OF ACCURACY, PRECISION, RECALL, AND F1-MEASURE MEASUREMENTS FOR DWKNN VOTING MODEL.