NS-SVM: Bolstering Chicken Egg Harvesting Prediction with Normalization and Standardization

.


I. INTRODUCTION
Several studies in computer science have led to an increase in the production of chicken eggs.Reference [1] proposed a machine that can sort chicken eggs based on their quality using light sensors, weight sensors, and fuzzy logic.Reference [2] made a chicken egg incubator connected to the Internet of things (IoT) and supplied solar energy.In machine learning, [3] predicted the price of chicken eggs using dynamic time warping (DTW) to make investing easier for chicken egg businesspeople.Reference [4] used a support vector machine (SVM) to predict the category of chicken eggs based on a combination of images and sensor metrics.However, although the results are adequate, the research is limited to a small number of datasets.
Using measurements based on various sensors sometimes results in features with ranges that have significant differences.For example, [5] used a humidity sensor ranging from 0 to 100.That range is in stark contrast to the lux sensor, which ranges from 0 to approximately 10000.Another example as in [6], although the sensor comes from a three-axis accelerometer, each gives a different axis range.The diversity of the range of feature values can result in the prediction model's performance decline [7].To avoid the decline, pre-processing techniques such as normalization and standardization can improve the prediction model's performance [8].
This paper proposes the normalization and standardization-bolstered support vector machine (NS-SVM) method, namely normalization and standardization, to improve the prediction of chicken egg harvest using SVM.First, we get the chicken egg dataset from Africa using Kaggle.The dataset has up to 13 features.Then we apply standard pre-processing such as label encoding and random oversampling.We also review the features of the dataset using the Pearson correlation coefficient (PCC).We use two SVM kernels: radial basis function (RBF) and 2 nd -degree polynomial.Then we again apply the same model but by applying normalization and standardization.We used crossvalidation with  = 10 to measure the accuracy of the compared models.
Many methods have implemented standardization and normalization to bolster their model's performance.However, to the best of our knowledge, there has never been a study that carried out normalization and standardization to improve the predictive performance of chicken egg harvest using SVM.Here are some of our research contributions: 1.A model for chicken egg harvesting prediction that has a better performance by adding ten features of chicken characteristics 2. A chicken egg harvesting prediction with a model bolstered by normalization and standardization, namely NS-SVM 3. A report that shows the positive effect of normalization and standardization on the model for chicken egg harvesting prediction using 2 nd -degree polynomial SVM.
The composition of the remainder of this paper is as follows: Section II discusses system design.Section III reports the test results and discusses the research results with state-of-the-art papers regarding chicken egg harvesting prediction.In addition, the discussion subsection also highlights the contribution of our research.Finally, Section IV presents the conclusions of the study.
Like ours, several studies have used features related to chickens' environment.Omomule et al. [9] used PCC to rank several features and produced four features: chicken age, weight, quality, and quantity.The study used a fuzzy method to predict the amount of egg production based on these four features.However, they need to explain the potential increase in model performance if the number of features is added to the model.There is a research opportunity to analyze whether feature selection with PCC has a significant effect or not on the prediction performance of the model.
We propose SVM as a classification model.Some studies use other models besides SVM.Gonzalez-Mora et al. [10] used random forest (RF) as a classification model for predicting egg production.This study uses the air quality index (AQI) as its feature.This research results show that the value of r 2 for RBF is 0.78.Comparing RF with AQI and SVM with features of the chicken environment is a research opportunity.
Several studies have observed the use of standardization and normalization to improve the predictive ability of a model.For example, Raju et al. [11] compared 12 different types of scaling, including normalization and standardization, to three classification models: RBF SVM, sigmoid SVM, and KNN SVM.The three models are models whose performance is determined by the distance between the data.The case study in this study used diabetes data.The results of this study show that the model that has gone through the 12 types of scaling has better performance than the model with raw data.There is a research opportunity to also compare the 2nd-degree polynomial SVM in the case study of chicken egg harvesting prediction.

II. METHOD
Fig. 1 shows our proposed research methodology.First, we obtain and observe the chicken egg dataset.Then we design a predictive model for chicken egg harvesting.The next step is to evaluate the model.After that, we re-iterate the process by applying normalization and standardization to the data.The next step is to evaluate the development of its predictive model.Finally, we report the findings of this research.

A. Chicken Egg Harvesting Data and Pre-Processing
The dataset we use is an egg-producing chickens' dataset from Kaggle.The dataset contains attributes and observations of 19 egg-producing chickens, with 1000 data items in the chicken and chicken egg dataset.Researchers may use the dataset without permission for machine learning and data science research purposes.Here we use it to predict the number of eggs to be harvested.
Furthermore, we explain the features contained in the dataset.Fig. 2 shows an explanation of each feature.Thirteen features include the characteristics of chickens, such as weight, age, and color of the body parts.In addition, there are also egg characteristics such as weight and color.Then there are other explanations, such as the amount of feed and the chickens' sunlight exposure.The label is the number of eggs a chicken produces in one day.The label range is 0 to 1 egg.
In addition to normalization and standardization, we also implement several other pre-processing stages.Since we are using SVM, we need to implement a label encoder, which is a process that converts a string to a unique integer [12].Then we apply PCC to evaluate the features present in the dataset [13].The PCC formula (denoted by r) is (1).
where  1 is the first variable with data item ,  is the dataset size,  1 ̅̅̅ is the average of the first variable,  2 is the second variable with data item  and  2 ̅̅̅ is the average of the second variable.Furthermore, we apply random oversampling [14].The application of random oversampling is for imbalanced data.Data imbalance occurs if the composition between the output labels is usually imbalanced.There are three degrees of imbalance.Mild if the minority label has a proportion of 20 to 40%, moderate if the minority label has a proportion of 1 to 20%, and extreme if the minority label has a proportion below 1%.The way random oversampling works is to add minority class data randomly until the class proportion is no longer imbalanced.Imbalance data can affect the prediction model's performance, so it tends to choose the majority class.

B. NS-SVM Prediction
We use SVM to predict chicken egg harvesting.SVM is a binary classifier, where this model will put data in one of the classes [15].SVM creates a hyperplane that separates the two classes so that the distance between the two classes is maximized [16].SVM has a kernel (( 1 ,  2 )), a function that forms a hyperplane [17].
Here we compare the RBF kernel and the 2 nd -degree polynomial kernel.The RBF kernel is popular because it separates data with a Gaussian distribution.It works similarly to the k-nearest neighbor (KNN) method [18].The RBF kernel formula is (2).
where  1 is the first feature,  2 is the second feature, and  is the variance.Moreover, if the linear kernel separates two data with a linear hyperplane, then the polynomial separates the two data with a hyperplane polynomial [19].The 2 nddegree polynomial function is a quadratic function, where the characteristic sign is the shape of the function in the form of a parabola.The 2 nd -degree polynomial kernel formula is (3).
where c is the free parameter and the value c ≥ 0.
The normalization function in pre-processing data changes all data ranges from 0 to 1 [20].A model can use normalization for numerical parameters that have varying scaling [21].The use of normalization is because the diversity of the scaling affects when the prediction model algorithm is compiling the model.
Standardization is a feature scaling technique that makes the average of a dataset 0 [22].In some cases, applying standardization before performing machine learning training can improve the performance of the training model and increase the training speed [20].Another result of standardization is that the effect of outliers also decreases.The normalization and standardization formulas is (4).
where x is the feature of a dataset, x min is the lowest value in the feature, x max is the highest value in the feature, μ is the mean of a feature, σ is the standard deviation of a feature, and finally, x ns is the standard form of the feature's normal form.By applying normalization and standardization, the dataset features will have a range of -1 to 1 and an average of 0.

C. Performance Metrics
The performance measurement of our model uses Kfold cross-validation, receiver operator curve (ROC), and the area under curve value (AUC).Sometimes a model has a generalization problem, namely the ability of a predictive model to have the same performance for each sub-sample dataset [23].Therefore, crossvalidation tests the model [24].Fig. 3 shows the K-fold cross-validation algorithm.K-fold cross-validation tests as many as K iterations [25].In each iteration, the algorithm divides the data into K equal parts.
The K-fold cross-validation algorithm makes one of  the validation set or , where the rest is for training.The validation uses accuracy (Acc k ).Here is the Acc k k formula (5).

𝐴𝑐𝑐 𝑘 =
+   +  +  +  (5) where  is a true positive,  is a true negative,  is a false positive, and  is a false negative.There will be as many  values   .The final step is to aggregate these values using (6).
where  is aggregate accuracy.The ROC and  measurement methods for imbalanced data are more precise than the accuracy score because they are scale-invariant [26].ROC is a curve that observes the relationship between true positive rate () and false positive rate () [24].When the positive prediction threshold increases,  and  will increase, which is observable by the ROC. is the area under the ROC curve.The greater the proportion of  compared to  at each point, the better the  value.Here is our formula for measuring  (7).
where  is the number of  and  measurements in the probability prediction class.

III. RESULT AND DISCUSSION
At first, we measure the PCC score of all the features in the chicken egg dataset.Fig. 4 contains the PCC score that measures the correlation between all features and the output, namely EggsPerDay.There are no features that have a strong or moderate correlation.The feature that has the PCC with the strongest correlation is Age.The category is weak negative.Two features that have a very weak correlation are AmountOfFeed and GallusWeight.The AmountOfFeed category is a weak negative, while the GallusWeight category is a weak positive.The remainder, which has a magnitude below 0.1, does not correlate with EggPerDay.
We prepared two datasets.The first is a raw dataset, while the second is a dataset that applies normalization and standardization.Fig. 5 shows the normalization and standardization results on each feature's value distribution.Before normalization and standardization, each feature has a diverse distribution, with a maximum population of 7. The maximum range is 0 to 200.After normalization and standardization, all features have an average value of 0. Then the maximum range is -5 to 5.There is no range disparity.The maximum population size is 1.4.The next step is to prepare the dataset for training.Our train-test ratio is 50:50.We randomize the dataset by stratification.Stratification maintains the ratio label in the train and test data.Before doing the training, we observe the dataset for imbalanced data.The application of imbalanced observation data is to the training dataset.Our label, EggsPerDay, has a value range of 0 to 1, meaning there are two classes.Table I resumes the properties.
In raw data, there are 18 data with label 0, while in data with label 1, there are 482.The minority label proportion is 3.6% of the dataset, so our training dataset has a moderate imbalance degree.After we apply random oversampling, the proportion of minority labels becomes 50%, with no imbalance in the random oversampled training data.Fig. 6 compares training datasets before and after applying random oversampling.We trained and compared two SVM models, one with the RBF kernel and the other with the 2 nd -degree polynomial kernel.Each model trains with two different datasets, namely raw and normalized and standardized datasets, the latter of which we named NS-SVM.This test compares a total of four SVM models.We used Kfold cross-validation to test all four models, with a value of  = 10.The test metric is Accuracy.Fig. 7 shows the results of the comparison.Of the four models, NS-SVM has a higher performance than SVM for each kernel.The results prove that NS-SVM improves the performance of chicken egg harvesting prediction.Then the 2 nd -degree polynomial kernel has a better performance than RBF.This comparison shows the suitability of the characteristics of the dataset.The model with the highest performance is NS-SVM with a 2 nd -degree kernel,  = 0.996.At the same time, SVM with RBF kernel is the model with the lowest performance, namely  = 0.986.
Finally, we conduct ROC and AUC testing.This test is because our dataset has imbalance problems to a moderate degree.Fig. 8 compares the ROC curves of the four SVM models.To interpret the curve, we use the AUC value.The curve with the highest AUC value is the 2 nd -degree polynomial NS-SVM, which is 0.993.The second highest AUC value is NS-SVM RBF, with a value of 0.992.The third and fourth highest values are NS-SVM RBF and 2 nd -degree polynomial SVM, with values of 0.975 and 0.927, respectively.We measure the PCC of 13 features in chicken egg harvesting prediction, where three have very weak correlations, namely weight, age, and feed quantity.Research [9] is also research that performs chicken egg harvesting prediction.The study stated that there were 12 features in chicken egg prediction but chose the same three features as our research.Although the study also used PCC, it did not mention other features of PCC results or did not measure them.In our research, if we add ten other features to the prediction, the performance can increase from  = 0.96 to  = 0.99.Our research contributes to the chicken egg harvesting prediction, which has a better performance by adding ten features of chicken characteristics.
Several other studies, such as [10], used an AQI dataset from chicken coops and used RF to predict chicken egg production, but the prediction performance was not as good as ours.Our research contributes to chicken egg harvesting prediction with an improved model, NS-SVM.
Several studies, such as [11], have shown that standardization and normalization influence predictive models that use the distance between data as the training algorithm.These prediction models include RBF SVM and KNN.In this study, apart from RBF SVM, we also show that normalization and standardization also positively influence the prediction model of 2 nd -degree polynomial SVM.Our research contributes to a report that shows the positive effect of normalization and standardization on the chicken egg harvesting prediction using 2 nd -degree polynomial SVM.

IV. CONCLUSION
We have successfully implemented a proposed novel model called NS-SVM, which can improve the performance of chicken egg harvesting prediction.We use a Kaggle dataset called Egg Producing Chickens Dataset, which consists of 13 features and one label.We compare two SVM kernels, the RBF and the 2nd-degree polynomial kernel.K-fold cross-validation and ROC AUC curve analysis are the test metrics we use.The results showed that normalization and standardization positively affect the prediction model of the two SVM kernels.The model with the highest performance is NS-SVM with a 2nd-degree kernel, where the  = 0.996.At the same time, the model with the lowest performance is SVM with RBF kernel, with an  of 0.986.In addition, the ROC AUC analysis results show that our model's performance on the imbalanced dataset with a moderate degree is  = 0.927 to 0.993.

Fig. 6
Fig. 6 Comparison of the training dataset before and after random oversampling