K-Means Clustering for Grouping Rivers in DIY based on Water Quality Parameters

- The Special Region of Yogyakarta (DIY) has rivers that cross rural and urban areas that are still used by the community and industry. However, cases of river water pollution in DIY are a major issue in 2021. It is very important to classify rivers according to class so that further analysis and action can be carried out. This study conducted a grouping analysis of rivers in DIY based on water quality parameters such as Total Suspended Solid (TSS), Dissolved Oxygen (DO), Biological Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Phosphate, Fecal Coli, and Total Coliform. The grouping method uses the K-means algorithm. The data source is secondary data from the DIY Provincial Environment and Forestry Service. The data is in the form of 56 river samples observed in November 2020. The description of the data shows that the average of the 56 river water samples is 24.95 for TSS, 8.84 for DO, 4.33 for BOD5, 20.36 for COD, 0.54 for Phosphate, 22,820 for Fecal Coli, and 59,210 for Total Coliform. The results of grouping with k=6 are the best compared to k = 2, 3, 4, 5, 7, and 8. The number of members in this grouping is n1 = 14, n2 = 1, n3 = 1, n4 = 5, n5 = 18, and n6 = 17. The cluster that has the highest average TSS, BOD, and COD values is the 3rd cluster (Rivers in Bantul and Sleman Regencies). The cluster that has the highest DO value is the 6th cluster (Rivers in Bantul Regency). The cluster that has the highest average Phosphate value is the 2nd cluster (Rivers in Bantul, Sleman, and Gunungkidul Regencies). The cluster that has the highest average Fecal Coli and Total Coliform values are the 4th cluster (Rivers in Bantul Regency, Yogyakarta City, and Sleman Regency).


I. INTRODUCTION
The Special Region of Yogyakarta (DIY) Environment and Forestry Service said that river water pollution is one of 17 DIY environmental issues or problems in 2021.In addition, it is also one of the three main issues that are a priority in improving environmental quality in DIY with the issue of waste and land conversion that are not in accordance with spatial planning.
Water resources are natural resources that are very important to support the needs of all living things.Water is used in various aspects of life, such as household activities, drinking needs, and other activities.Water resources is divided into two, namely surface water and groundwater.Surface water that is often used by humans is river water, while groundwater that is often used by humans is well water.All of these water sources must always be maintained so that living things can live and reproduce.
DIY has several rivers that flow through urban and rural areas.Many things affect the quality of river water, including the population growth, human activities, and industry.The rate of population growth has led to an increase in settlements in river basins.This makes controlling river water quality more difficult.This also has an impact on the management of domestic waste in river water which is not yet optimal.The Central Statistics Agency states that the average population growth rate in DIY in 2020 is 1.01%.The highest population density in the city of Yogyakarta is 13,413 people/km2 [1].
Based on calculations, 10 rivers have polluted conditions.The parameters of fecal coliform bacteria and total coliform have a major contribution as sources of contaminants that cause the low value of the pollution index.The high parameter of the coli bacteria indicates that domestic waste management has not been handled properly.
Given the role of river water quality in protecting ecosystems and human life, it is necessary to analyze river water quality.Each river flow in DIY has different quality and pollutant characteristics.Therefore, it is necessary to carry out a location grouping analysis to obtain information on which locations have a high potential for experiencing water pollution.Reference [2] have conducted an analysis of the evaluation of river water quality using the hierarchical clustering method.This study also states that it is important to classify rivers according to their class so that further analysis and action can be carried out.Reference [3] has also used the clustering method which is useful for obtaining water quality ratings, classifying water quality distribution characteristics, knowing variations in pollutant characteristics at each location and time, and finding short-term pollutant conditions.
Many grouping methods can be used, such as Fuzzy C-Means [4], Multi-Layer Perceptron [5], ANFIS [6]- [7], Naive Bayes [8], and others.In this study, researchers used one of them, namely clustering with the K-means algorithm.The K-means method is an unsupervised machine-learning method for grouping observations based on defined characteristics.K-means is a data clustering method for partitioning existing data into one or more groups so that data with the same characteristics can be grouped into the same group [9].
K-means is included in cluster analysis where k is the number of clusters.According to [10], the K-means algorithm is a algorithm to run and implement, because K-means has the ability to group large amounts of data with relatively fast and efficient computational time and is adaptable.The concept in k-means is to get the minimum variation value where each cluster with the distance between the data and the center point of the cluster must be minimum.If in a cluster there are still relatively large variations, the cluster can still be split into two different clusters.
To determine the optimal number of clusters, researchers can use various methods such as the Silhoutte method, the Elbow method, or with a predetermined number of clusters.To get the optimal number of clusters, the Elbow method is used.This method is a method used to determine the best number of clusters by looking at the percentage of the results of the comparison between the number of clusters that form an angle at a point.
Several studies using the K-means method are [11] to identify homogeneous areas of groundwater quality.Reference [12] used the K-means method to classify the status of water quality in rivers in Banjarmasin, Indonesia.Then [13] grouped Balinese handicraft products using the K-Medoids algorithm.Research conducted by [14] to analyze cyberbullying through Instagram and [15] is used for classifying store sales data.Then [16] identified the availability of health human resources in Central Java.In addition, [17] conducted research on the performance of PDAMs in providing water quality, namely based on healthy and unhealthy, unhealthy and sick, and healthy and sick features.In [18], the K-Means algorithm is used to classify poverty in Papua.Reference [19] also uses the k-medoids method.
Based on the problems discussed, this research applies the K-means clustering method to classify rivers in DIY based on water quality parameters.The research results are expected to form river groups.Furthermore, the research results also provide information on rivers that have the same characteristics based on the water quality parameters used and the potential of existing pollutants.

II. METHOD
The source of the data in this study was secondary data from the Yogyakarta Special Region (DIY) Environment and Forestry Service in the 2020 DIY Environmental Quality Index book [20].The data is in the form of 56 river samples observed in November 2020.Variables include Total Suspended Solid (TSS), Dissolved Oxygen (DO), Biological Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Phosphate, Fecal Coli, and Total Coliform.
The research steps in using the K-means clustering method are as follows.average value of the data that is a member of the cluster f.Repeating steps 'b' to 'e' so that no data is moved to another cluster.
5. Comparing the clustering results at k = 2, 3, 4, 5, 6, 7, and 8 using the standard deviation ratio, the value of the F test statistic, and the value of the silhouette index coefficient.

Conduct cluster member profiling based on the average value
The steps in K-means analysis are assumption test, cluster formation, and cluster validation and profiling.Fig. 1 shows the steps in forming a cluster.

A. Data Description
Descriptive analysis can be used to describe the characteristics of river water quality data in 56 samples based on specialist doctors, namely, Dissolved Oxygen The TSS variable has a minimum value of 0.02, a maximum value is 147.00, an average value is 24.95, and a standard deviation is 29.38.The quality standard is 50 and the number of samples exceeding the quality standard is 5 samples.The average value for the DO variable is 8.84, the minimum value is 4.42, the maximum value is 63.10, and the standard deviation value is 7.52.The quality standard is 4 and the number of samples that exceed the quality standard is 23 samples.
The average value for the BOD5 variable is 4.33, the minimum value is 0.30, the maximum value is 21.30, and the standard deviation value is 2.94.The quality standard is 3 and the number of samples that exceed the quality standard is 21 samples.The average or mean value for the COD variable is 20.36, the lowest or minimum value is 1.40, the maximum value is 72.00, and the standard deviation value is 11.43.The quality standard is 25 and the number of samples that exceed the quality standard is 3 samples.The average or mean value for the Total Coliform variable is 59,210, the lowest or minimum value is 90.00, the highest or maximum value is 920,000, and the standard deviation value is 140,598.The quality standard is 25 and the number of samples exceeding the quality standard is 9 samples (Table II).Table II shows the number of samples based on water quality status.Fulfillment status has a frequency of 3. Mild status has a frequency of 31.Moderate status has a frequency of 18. Severe status has a frequency of 4. The total water quality status is 56.

B. K-means Clustering Algorithm
In using K-means clustering, the first step is to determine the optimal number of clusters using the Elbow method.This method is one of the methods that is often used to determine the optimal number of clusters by looking at the percentage of the results of the comparison between the number of clusters that will form an angle at a point.The results of the Elbow method are presented in Fig. 2. It can be seen that the line that has a fracture that forms an elbow is at k = 6, meaning that using the Elbow method the best number of clusters is obtained, namely 6 clusters.However, to see the comparison of the number of clusters, calculations are performed using k = 2, k = 3, k = 4, and k = 5, k = 6, k = 7, and k = 8.
After getting the optimal number of clusters, then do clustering with k-means algorithm.The results of the number of members in each cluster are presented in Table III.
Fig. 3 shows the visualization of the grouping results in each cluster (k).This visualization is formed from two dimensions.Dimension 1 explains the clustering result of 31.5% and dimension 2 explains 24%.Good clustering is indicated by high homogeneity between observations within the cluster and high heterogeneity between clusters.High homogeneity between observations within the cluster is shown by the locations of the observations that are close together.Meanwhile, high heterogeneity between clusters is indicated by the large distance between clusters.If you look at the visualization comparison of grouping, grouping with k = 2, 3, and 7 is better than k = 4, 5, 6, and 8.As an illustration in k = 3, observations in cluster 1 are close together and denoted in red.Likewise, the observations in cluster 2 are close together and denoted in green.Furthermore, the observations in cluster 3 are close together and denoted in blue.These three groups also have large distances or colors that do not overlap.
After knowing the results of clustering, then do a comparison to get the best grouping results.The method used is to compare the value of the standard deviation ratio and the value of the F test statistic from MANOVA.In addition, validation was also carried out to find out whether the cluster results obtained were valid to use or not.The method used is to look at the value of the silhouette index coefficient.The results of the standard deviation ratio, and the value of the F test statistic, and the value of the silhouette index coefficient are presented in Table IV.c. Silhouette Index Coefficient The silhouette coefficient method is a combination method between the cohesion and the separation method.The separation method serves to measure how far a cluster is separated from other clusters.The function of the Cohesion method is used to measure how close the relationship is between objects in a cluster..The silhouette index value that is getting closer to the value 1 then the grouping will be better or valid.Based on the comparison, the greatest silhouette index value is 0.29 with the number of clusters 2 and 4. Visualization of the comparison results is shown in Fig. 4.
This study chooses the results of grouping with k = 6 as the best.Therefore, profiling was carried out at k=6.This profiling aims to know the description and characteristics of each variable in each cluster.Profiling is done based on the average value.Table V shows the profiling of each variable in each cluster with k=6.
The cluster that has the highest average TSS, BOD, and COD values is the 3rd cluster.This cluster consists of the Gajahwong River in Bantul Regency and the Content River in Sleman and Bantul Regencies.The cluster that has the highest average DO value is the 6th cluster, namely the Gajahwong River in Bantul Regency.The cluster that has the highest average Phosphate value is the 2nd cluster.This cluster includes the Winongo River, Gajahwong River, Code River, and Bedog River in Bantul Regency, the Belik River in Sleman Regency, and the Oyo River in Gunungkidul Regency and Bantul Regency.The cluster that has the highest average values of Fecal Coli and Total Coliform is the 4th cluster.This cluster includes the Winongo River and Bedog River in Bantul Regency, the Code River in Yogyakarta City and Bantul Regency, the Kuning River, the Belik River, and the Bulus River in Sleman Regency, as well as the Gajahwong River in Sleman, Yogyakarta, and Bantul Regency.

IV. CONCLUSION
The results of the data description show that there are still many river locations that have levels above the quality standard, including TSS, DO, BOD, COD, Phosphate, Fecal Coli, and Total Coliform.Through kmeans clustering analysis with a value of k = 6, the cluster that has the highest average TSS, BOD, and COD values is the 3rd cluster (namely the Rivers in Bantul and Sleman Regencies).The cluster that has the highest average DO value is the 6th cluster (namely the River in Bantul Regency).The cluster that has the highest average Phosphate value is the 2nd cluster (namely the Rivers in Bantul, Sleman, and Gunungkidul Regencies).The cluster that has the highest average Fecal Coli and Total Coliform values are the 4th cluster (namely Rivers in Bantul Regency, Yogyakarta City, and Sleman Regency).The number of observations of each river is n1 = 14, n2 = 1, n3 = 1, n4 = 5, n5 = 18, and n6 = 17.

4 .
Perform the K-means clustering algorithm: a. Determine the number of clusters (k) with the Elbow method.This study uses k = 2, 3, 4, 5, 6, 7, and 8 b.Determine the centroid value or cluster center point c.Calculate the distance of each centroid point to the point of each object d.Grouping data into clusters with the closest distance e. Calculating the new cluster center by finding the

ACKNOWLEDGEMENT
We would like to say thank you to Department of Environment and Forestry of the Special Region of Yogyakarta for providing research permits in accessing data; and Institut Sains & Teknologi AKPRIND Yogyakarta for funding this research; and Study Program of Doctoral Environmental Science, School of Postgraduate Studies, Diponegoro University for providing the excellent support and cooperation to produce the work described ini this paper.

Fig. 4
Fig. 4 Silhouette clusters plot with k = 2, 3, 4, 5, 6, 7, and 8 1. Prepare river water quality data 2. Performing assumption tests for K-means clustering.This test includes multicollinearity test and outlier detection test.The multicollinearity test is carried out by looking at the output correlation value between variables where the correlation value is not more than 0.95.Meanwhile the detection of outlier data uses a