Implementation of Principal Component Analysis and Learning Vector Quantization for Classification of Food Nutrition Status

Balanced nutrition is very good in the process of child growth. During the COVID-19 pandemic, consuming a balanced, nutritious diet can keep a child's immune system from transmitting the virus. In determining the nutritional content of children's food during the pandemic, a classification of the nutritional content of children's food is carried out by applying the principal component analysis (PCA) dimension reduction method and the learning vector quantization (LVQ) classification method. The data used in this study is based on Indonesian food nutritional value data from the Ministry of Health of the Republic of Indonesia amounted to 1146 data with 25 indicators of food nutrients. From the tests that have been carried out, the combination of the PCA-LVQ method produces an average accuracy of 58% with the highest accuracy of 60%. In addition, this study also compares the performance of the PCA dimension reduction method, independent component analysis (ICA), and factor analysis (FA) on the LVQ classification process. The final result of testing the three methods is that the FA method takes the fastest time, which is 4.10434 seconds and the PCA method produces the highest accuracy, which is 58.2%.


I. INTRODUCTION
The COVID-19 (coronavirus) pandemic became the largest global health crisis that took place throughout 2020 with an unprecedented death toll and socioeconomic impact [1]. According to the Head of the Family Health and Nutrition Section of the Bandung City Health Office, Dewi Primasari, the pandemic has had a significant impact on nutritional problems, especially toddler nutrition. This causes nutritional problems in 2020 to increase to 5.33% from the previous year. Balanced nutrition is very good in the process of growth and development, especially in children. Balanced nutrition can be obtained from food intake that meets the body's nutritional needs according to the age and activity of a child. Although no food can prevent COVID-19 infection [2], maintaining a balanced diet is very important in boosting the immune system during a pandemic.
In conducting socialization in the community and educating children about nutrition that meets balanced nutrition during a pandemic, a system is needed that can help nutritionists determine the classification of nutritional adequacy or nutrition in children's food during a pandemic. Several classification methods such as Learning Vector Quantization [3], Fuzzy Logic [4], Naïve Bayes Classifier [5], and K-Means Clustering [6] have been proposed by several researchers in conducting nutritional status classification.
Nutrients or food nutrition datasets have many data indicators (attributes), including water, carbohydrates, protein, calories, fiber, iron, vitamins, and so on. Where the reduction of attributes or data dimensions must be carried out to determine the main attributes or components of the dataset. Several dimension reduction methods such as Principal Component Analysis [7], Independent Component Analysis [8], Factor Analysis [9], and Latent Semantic Analysis [10] have been proposed by several researchers in reducing dataset dimensions.
This research was conducted by applying the PCA method in reducing attribute dimensions to determine the most important attribute in the data, as well as the LVQ method to classify the nutritional status of children's food that can meet nutritional needs based on the attributes that play the most role in the data so that the classification results are more accurate. This study aims to measure the accuracy of the principal component analysis (PCA) and learning vector quantization (LVQ) methods in the nutritional classification of children's food during the pandemic.

II. METHOD
Several stages are done in this research, namely, the correlation analysis stage, attribute reduction stage with PCA, the design of the LVQ model, classification stage, and the testing process.

A. The Dataset
The dataset used in this study is based on Indonesian food nutritional value data from the Ministry of Health of the Republic of Indonesia in 2020 with a total of 1,146 food data with 25 indicators including water, energy (calories), protein, fat, carbohydrates, fiber, ash, calcium, phosphorus, iron, sodium, potassium, copper, zinc, retinol (Vitamin A), beta-carotene, total carotene, thiamine (Vitamin B1), riboflavin (Vitamin B2), niacin, vitamin C, BDD, types, groups, and sources. In initializing the target class of food nutritional status based on the dataset used. Then 2 target classes were grouped, namely the nutritional status class of foods that met and did not meet during the pandemic, based on the Final Guide to Balanced Nutrition during the Covid-19 Period from the Ministry of Health of the Republic of Indonesia in 2020 [2]. Labeling on the dataset is done crowdsourcing. "0" represents non-nutrients in pandemic times and "1" represents fulfilling nutrition in pandemic times. Then the dataset used will be divided into a ratio of 75-25, where 75% is used as training data, and 25% as test data. Fig. 1 is a display of the food composition dataset.

B. Correlation Analysis Stage
In the correlation analysis process, input data in the form of a food composition dataset to form a correlation matrix for the relationship of each nutrient indicator in children's food using the Pearson product-moment correlation coefficient. The Pearson product-moment correlation coefficient has the following conditions for the degree of proximity [11]: • Coefficient value 0 = There is no relationship at all. • Coefficient value 1 = Perfect relationship. • Coefficient value > 0 to < 0.2 = Very weak relationship.
• Coefficient value 0.4 to < 0.6 = the relationship is quite strong.
• A negative value means determining the direction of the opposite relationship.
Coefficient values of -1 and 1 are perfect relationships, coefficient values of 0 or close to 0 are considered to have no relationship between the two variables tested. The formula used to calculate the correlation coefficient is in (1). (1) = correlation coefficient between x and y variables = values of the horizontal axis in the coordinate plane = values of the vertical axis in the coordinate plane ∑ = the sum of the values of x and y 2 = square of the value x 2 = square of the value y There will be an example of correlation analysis calculations on data in Table I. By using label encoding, type "1" represented processed and type "0" represented raw. Group "4" represented fish, group "6" represented sugar, and group "9" represented vegetables. Calculate the correlation coefficient between type attributes and group attributes with equation 1.

C. Attribute Reduction Stage
The process of reducing dimensions or attributes on the food composition dataset is carried out after going through the correlation analysis stage. Attribute reduction using the PCA method is carried out on the original attribute set to obtain the minimum set of attributes, then the acquisition attribute with the maximum ratio is selected after attribute reduction [12].
The stages of PCA completion consist of calculating the mean value, normalizing the data, calculating covariance, and determining vectors and eigenvalues [13]. The following are the stages of PCA using the sample dataset in Table II. By using label encoding, type "1" represented processed and type "0" represented raw. Group "4" represented fish, group "6" represented sugar, and group "9" represented vegetables.
Examples of data normalization on water 1 st data: a. Calculate the covariance of the data population is in (4).
Calculating the covariance of the data sample is in (5).
( , ) = covariance between x and y variables = variable x = variable y ̅ = mean of the value x ̅ = mean of the value y = number of data Examples of covariance calculation on type and group attributes of normalization result using equation (5):  Suppose A is a covariance matrix, is a vector, and is a scalar that satisfies Aν = λv , then is called the eigenvalue associated with the eigenvector of A as in (6).
Simplify the covariance matrix to det(A-λI) matrix as in equation 6, then perform column reduction on the det(A-λI) matrix. Then we get some eigenvalues, namely λ1 = 3.3894, λ2 = 0.588, λ3 = 0.0214, λ4 = 0.0012 and λ5 = 0.  Determine the principal components (PCs) At this stage, the eigenvalues (λ) are sorted first from the largest to the smallest value. Then, by using the PCs selection criteria from Kaiser-Guttman, the larger eigenvalues or greater than 1 should be retained. So from the calculation results obtained three principal components (PCs), namely PC1, PC2, and PC3 as new dimension data (Table III).

D. Design of the LVQ model
Informing a classification model and carrying out the classification process with the LVQ network model. LVQ conducts learning on the competition layer as shown in Fig. 2.
Based on Fig. 2, X₁, X₂ -Xn is an input vector. These input vectors are connected to the W₁ and W₂ weight vectors. X -W is the process of calculating the distance between input vectors and weight vectors based on the activation functions F₁ and F₂. The activation function F₁ will map the output vector (y_in1) to class Y₁ = 1 if X -W₁ < X -W₂, and map to class Y₁ = 0 if otherwise. Similarly, the activation function F₂ will map the output vector (y_in2) to F₂ = 2 if X -W₂ < X -W₁, and map class Y₂ = 0 if otherwise. Y₁ is the first-class output and Y₂ is the second-class output. The output vector class obtained as a result of this competition layer depends on the distance between the input vectors and the weight vectors [14].
The stages of LVQ completion consist of parameter initialization, calculating the euclidean distance, updating the weights, and determining the optimal weight [15]. The following are the stages of LVQ using principal components in Table IV.  Initialize data and parameters Set the initial weight value ( ) where i = weight and j = weight input variable. There are two target classes, namely 1 and 0, so initialize the training data with W₁ = (2.06, -0.40, -0.21) as class 1 and W₂ = (-2.07, -0.32, -0.09) as class 0.

Fig. 2 LVQ architecture
The initial stage is initializing the alpha value (learning rate), decrement alpha, minimum alpha, and maximum epoch as input. The equation for determining the epoch and decrement alpha is in (7) and (8).  Update the weight Update the weight of the class with the smallest euclidean value. The weight update is carried out with the condition that Cj is the output category or vector class for training and Cx is the category or class that corresponds to the j output unit, as in (10)   Determining the optimal weight value After the weight renewal process is carried out, the optimal weight value will be obtained in the form of an optimal weight matrix ( Table V).

E. Classification Stage
The optimal weight matrix that has been obtained is used as a weight vector for the calculation of the input vector at the time of classification. The input vector is initialized as x. Then calculate the Euclidean distance to X with the weight of the optimal weight matrix. From the calculation results, if the Euclidean distance obtained is close to the "1" class or the value obtained W₁ is smaller, the input vector enters the "1" class or is classified as the "1" class. Similarly, if the Euclidean distance obtained is close to the "0" class or the value obtained W₂ is smaller, the input vector enters the "0" class or is classified as the "0" class.

F. The Block Diagrams
The block diagram describes the system process flow from input to output, as shown in Fig. 3.  Fig. 3 is shown the dataset as training data. Then the correlation analysis between attributes. Attributes that are correlated will go through a reduction stage using the PCA method. The new reduced dataset is then used for the classification process, in the process of data classification initialized as a vector. Each vector is ready to be formed at the competition layer based on the LVQ architecture model. Then the process of initializing the initial weight, epoch, and learning rate (α). Updating the vector weights by calculating the Euclidean distance. The classification results that have been obtained in the training process are stored for the nutritional classification of children's food in the testing process. The output of the system is the result of the classification of the nutritional status of children's food enough nutrition or does not enough nutrition in the pandemic.

G. Testing
At this stage, after going through the preprocessing of PCA and the LVQ model classification process, it will produce the system output in the form of the nutritional status of children's food that has been predicted. For performance testing of classification models that have been created, testing is done using test data of 25% of the entire dataset. Model performance testing is conducted using accuracy, precision, recall, and f−measure or the f1 score.

III. RESULTS AND DISCUSSION
The study aims to measure the accuracy of models based on nutritionally sufficient categories and insufficient nutrition in pandemic times by implementing PCA and LVQ methods. For training, data is divided by a ratio of 75% data train and 25% data test. The model was built using 100 epochs, an alpha value of 0.1, a random state value of 24, and 2 initial weights representing 2 target classes.

A. Correlation Analysis
Correlation analysis was conducted to determine the relationship between each nutrient indicator in children's food using the Pearson product-moment correlation coefficient (Fig. 4).

Fig. 4 Matrix correlation
Based on Fig. 4 the results of the correlation analysis and the provision of the degree of proximity of the Pearson product-moment correlation coefficient, each nutritional indicator to the nutritional indicator itself is worth 1, meaning that it has a perfect relationship. The calorie and fat indicators are 0.77, meaning that the higher the calorie content, the higher the fat content with a degree of closeness of 0.77 or a strong relationship. Indicators of protein and phosphorus are worth 0.54, meaning that the higher the protein content, the higher the phosphorus content with a degree of closeness of 0.54 or a strong enough relationship. The water and calorie indicator are worth -0.91, meaning that the higher the water content, the lower the calorie content with a degree of proximity of 0.91 or a very strong relationship. The water and carbohydrate indicator is worth -0.72, meaning that the higher the water content, the lower the carbohydrate content with a degree of closeness of 0.72 or a strong relationship.

B. Attribute Reduction
Dataset attribute reduction was performed using the PCA method. From the acquisition of eigenvalues, the most influential attribute ranking is carried out. The ranking process is based on the eigenvalues from the largest to the smallest and the cumulative variance value. Table VI shows the ranking of each attribute based on eigenvalues and cumulative values.
Based on Table VI, of the 25 indicators, there are 10 indicators with the largest eigenvalues representing the 10 selected principal components based on the Kaiser-Guttman rule states that components based on eigenvalues greater than 1 should be retained. Namely type, group, water, fat, calcium, phosphorus, potassium, copper, zinc, and vitamin A. Fig. 5 shows the obtained principal components.

C. Model Training
At this stage, the training model is carried out with the learning vector quantization network architecture. Parameter initialization is carried out for the training model with a learning rate (α) value of 0.1 with a total of 100 epochs. In the model training process, using features from the results of PCA reduction and without PCA reduction. Wherewith, PCA reduction uses 10 features or principal components, while without PCA uses 25 initial features or attributes. The results of the training model, the optimal weight matrix obtained from the renewal of the weight vector against the value of the principal components (input vector) in the competition layer, which is then used to determine the classification results at the testing stage. The optimal weight matrix based on principal components in Fig. 6.
Experiment to classify the nutritional status of food on the data train with training model (Table VII), while experiment to test the training model performance based on the number of test data (Table VIII).

D. LVQ and LVQ-PCA Performance Testing
A comparison test of the performance of the LVQ classification model was carried out before the reduction and after the reduction with PCA was shown in Table IX.
From the results of testing the formation of a classification model with the application of the PCA dimension reduction method, the nutritional classification of children's food with the LVQ model works better than before the reduction. Where before the PCA reduction, of the 292 test data there are 9 TP or sufficient nutritional test data that is predicted as sufficient nutrition and 10 FP or test data that is insufficient nutrition or not but predicted as enough, while after being reduced there are 119 TP or sufficient nutritional test data that is predicted as sufficient nutrition and 100 FP or test data that is insufficient nutrition or not, but predicted as enough.

E. Performance Testing of Dimension Reduction Method
Testing by Independent Component Analysis (ICA) and Factor Analysis (FA) dimension reduction methods in the same dataset. The ICA dimension reduction method shows an accuracy value of 0.503, a precision value of 0.464, a recall value of 0.636, a f1 score of 0.537, and a consumption time of 5 seconds. The dimension reduction method of factor analysis shows an accuracy value of 0.517, a precision value of 0.469, a recall value of 0.508, a f1 score of 0.487, and a consumption time of 4.1 seconds. A comparison test of the performance testing of the dimension reduction method with PCA, ICA, and FA was shown in Table X.
The test results show that the PCA dimension reduction method is better than ICA and FA in the nutritional classification of children's food during the pandemic. The PCA dimension reduction method shows an accuracy value of 0.582, a precision value of 0.543, a recall value of 0.844, a f1 score of 0.661, and a consumption time of 4.1 seconds.

IV. CONCLUSION
Based on the results of the tests that have been carried out, the implementation of the PCA dimension reduction method in the LVQ classification model can classify the nutritional status of children's food in the pandemic. From the comparison test for the reduction method, the PCA method works better than the ICA and FA dimension reduction methods, where PCA can reduce 25 nutritional attributes of food to 10 principal components with an accuracy rate of 58% with a long processing duration of 4.1 seconds for the classification model. In developing the system in the future to improve the performance, there are several suggestions. Based on previous research, the LVQ method has better performance because the number of attributes used is less, for future research that will apply the LVQ method should use more attributes or use classification methods other than the LVQ method.