BCBimax Biclustering Algorithm with Mixed-Type Data

- The application of biclustering analysis to mixed data is still relatively new. Initially, biclustering analysis was primarily used on gene expression data that has an interval scale. In this research, we will transform ordinal categorical variables into interval scales using the Method of Successive Interval (MSI). The BCBimax algorithm will be applied in this study with several binarization experiments that produce the smallest Mean Square Residual (MSR) at the predetermined column and row thresholds. Next, a row and column threshold test will be carried out to find the optimal bicluster threshold. The existence of different interests in the variables for international market potential and the number of Indonesian export destination countries is the reason for the need for identification regarding the mapping of destination countries based on international trade potential. The study's results with the median threshold of all data found that the optimal MSR is at the threshold of row 7 and column 2. The number of biclusters formed is 9 which covers 74.7% of countries. Most countries in the bicluster come from the European Continent and a few countries from the African Continent are included in the bicluster.


I. INTRODUCTION
In real data, it is possible to involve various kinds of data which do not only consist of numeric or categorical data, but rather a mixture of both (mixed-type data).Choosing an approach to handling mixed data is still a challenge, especially for biclustering methods because it is very dependent on datasets.There are several approaches to conducting mixed data analysis, one of which is converting all variables into categorical variables and then proceeding with clustering analysis [1].Another approach is to quantify categorical variables into numerical variables using a dimension reduction technique, followed by a clustering technique [2].
Cluster analysis is a method of grouping objects based on their similar characteristics [3].Cluster analysis commonly used is one-way clustering.This analysis assumes that objects have similar characteristics in all rows or columns, so that objects in rows are grouped based on similarity in columns or variables in columns are grouped based on similarities in rows.One-way clustering has been carried out separately, namely by grouping objects in rows using a distance measure matrix, then grouping variables in columns using a correlation matrix [4].Clustering like this still has limitations for two-way data that want to know the relationship of a particular group of objects with a specific group of variables together.Biclustering is a development of cluster analysis and aims to group data simultaneously from two directions or two dimensions.The two-way clustering technique (biclustering) is then applied to the gene expression matrix, namely the matrix data contains real numbers that show the activity of several different genes (rows) and experimental conditions (columns) [5].
Biclustering algorithm classification is divided into five namely greedy iterative search, divide and conquer, exhaustive bicluster enumeration, iterative row and column clustering combination, and distribution parameter identification [6].In this study, the algorithm to be chosen is the BCBimax algorithm which is a divideand-conquer category.The BCBimax algorithm is a technique introduced to research market segmentation using customer pain points [7].This algorithm is a development of the Bimax algorithm which can avoid overlapping and also solve problems in the Bimax algorithm which produces too many small and many biclusters [8].This method is fast and precise in producing optimal biclusters so that it becomes a reference for other algorithms [9].The BCBimax algorithm has been widely applied in identifying customer segmentation based on behavior [10] and has been applied in the health sector to identify subgroups of lymphoma cancer survivors [8].
Biclustering analysis to mixed data is still relatively new because initially this analysis was mostly used on gene expression data that has an interval scale.In addition, so far no algorithm has been developed that can accommodate mixed data at once.So that in this study we will first transform ordinal categorical variables into interval scales.One method that is widely used in transforming ordinal categorical data into an interval scale is the Successive Interval Method (MSI).The MSI transformation is a method of transforming ordinal data into interval data by changing the cumulative proportion of each variable in the category to its standard normal curve value [11].
The open economic system and globalization make international trade even more important because every country uses it to analyze economic development and formulate economic policies [12].In determining international trade potential, it is necessary to study many factors such as cultural, economic, and demographic factors where these factors consist of mixed data types, namely numerical data and categorical data.In determining market potential, each company also chooses different interests in determining variable priorities.The existence of different interests in the international market potential variables and the many destinations for Indonesia's export countries is the reason for the need for identification regarding the mapping of destination countries based on international trade potential in order to map destination countries for exports abroad appropriately and efficiently.Several studies regarding the grouping of export destination countries have been carried out, such as grouping Turkey's export destination countries based on an assessment of market potential using factor analysis and K-Means [13] and grouping potential markets in developing countries using Markov chains [14].
Based on this discussion, this research will aim to carry out a biclustering analysis using the BCBimax algorithm for potential international trade data, especially in 103 countries in the world.The results of this study are expected to be a reference for government policymaking in increasing exports to potential countries based on the results of the biclusters formed.

A. Data
The variables used in this study were 15 variables with 10 numerical variables and 5 ordinal categorical variables.We extract the data of variable X1-X9 and X14-X15 from World Development Integration Data by World Bank, variable X10 from Global Sustainable Competitiveness Index (GSCI) [15], variable X11 using calculation with formulation [16] to [17], variable X12 from Amphori [18], and variable X13 from FM Global [19].The research data used was obtained from a total of 103 countries as presented in Table I.R-Studio software is used as a data analysis tool.

B. Research Stages
In this research, there are three stages of research: pre-processing, inverse and data transformation, and biclustering analysis.The following is a complete explanation.
1) Pre-processing: At this stage, data cleansing will be carried out by deleting data that contains missing values.After obtaining complete data and producing as many as 103 countries, the next step is to standardize the numerical variables.The results of the data that have been standardized will then be carried out with exploratory analysis to see the initial characteristics of the data.
2) Inverse and data transformation: The dataset on the variables X2, X5, and X11 will be inverted by -1 to create a meaning that is in line with other variables according to the definition of international trade potential.In addition, the ordinal variables X13, X14, and X15 will be transformed using MSI into interval data.The following are the stages in carrying out the MSI transformation [11]: 1. Calculate the frequency of observations for each category 2. Calculate the proportion in each category 3. From the proportions obtained, the cumulative proportions for each category are calculated.4. Calculates the Z value (normal distribution) of cumulative proportions 5. Determine the Z limit value (the value of the probability density function in the abscissa of Z) for each category, by ( 1) ) , −∞ <  < +∝ (1) 6. Calculating the scale value for each density category lower limit-density upper limit area under the upper limit -area below the lower limit (2). 7. Calculating the score (transformed value) for each category through (3).The existence of an inverse value aims to make a meaning that is in line with other variables according to the definition of international trade potential 3) BCBimax Algorithm: The idea behind the BCBimax algorithm is to partition the binary data into three submatrices, one of the partitions will only contain 0 elements so they can be discarded.The algorithm is then applied recursively to the remaining two submatrices, namely U and V which are the results of the step 3 process, the recursion will end if a submatrix containing only 1 elements is formed.As a way to avoid overlapping, the next bicluster search uses an algorithm based on the submatrix that does not include the previous bicluster row.The stages of the algorithm's work are as follows as in Fig. 1.

4) Optimal bicluster selection:
The evaluation function is an indicator to measure performance on a biclustering algorithm.To choose the most optimal bicluster from several threshold trials, a comparison value is needed.In this study, the Mean Square Residue (MSR) evaluation function will be used as a basis for selecting the optimal bicluster and is defined in (1).Suppose   denoted as the average of the row-th  of bicluster (, ),  is the average of the column-th , (, ) and   is the average of all elements in the bicluster.Based on [5], it can be written as (4).
The residues of the elements   in the submatrix   are as follows   =   −   −   +   .The quality of the bicluster can be evaluated by calculating the residuals , which is the sum of all the squared residues of all the elements.The residue  here in after referred to as MSR as follows in (5): where || × ||is the dimension of the bicluster (volume), ie is the size of || the bicluster row and ||is the size of the bicluster column.The quality of a bicluster will be better as the residual value decreases and/or the volume of the bicluster increases.Qualities of the bicluster group the next MSR-based can be measured by calculating the average of MSR divided by the volume or the average MSR per volume and is defined by (6) [20]: with b is the number of biclusters generated by an algorithm certain.

A. Data Exploration
The description of the data related to the initial characteristics of each variable can be seen from the boxplot in Fig. 2(a) Through the boxplot in the figure, it can be seen that almost all variables have extreme values except for variables X4, X10, and X11 which do not contain outliers.This indicates that no country in the world has an extreme urban population percentage, competitiveness index, and cultural similarity index.Most of the median data tends to be negative (below zero).This indicates that the country in most of the variables concerned tends to be of moderate to low value.
In Fig. 2(b) shows the frequency of the four categorical variables X12 to X15.From the four categorical variables, it can be seen that the proportion of frequency in each value is not significantly different except for X15.At X15, the frequency of countries with code 1, namely developing countries, seems to dominate compared to transitional countries and developed countries.

B. Ordinal Data Transformation
Transformation using MSI will change the data scale from ordinal to interval.This is done because the biclustering analysis can only be used for data with an interval scale only or only an ordinal scale.
The results of the MSI transformation are shown in Table II.The variables X13, X14 and X15 are ordinal variables so that the MSI transformation will be carried out, while X12 will not be transformed because it is a binary variable.Variable X13 has 4 categories with almost the same proportion of samples in each category, namely in the range of 0.2, as well as X14, where the proportion of samples in the three categories is not much different.In X15, the proportion is not the same, the majority of countries are developing countries (category 1), then a few countries are transitional countries (category 2).Furthermore, for biclustering analysis, the values of the variables X13, X14, and X15 will be changed with the values from the Interval column which are the resulting values of the transformation.

C. Biclustering Results
In the BCBimax algorithm, a binarization process is required before biclustering analysis is carried out.The binarization process in this study will be carried out in 4 trials, namely: 1) system threshold value, 2) median of all data, 3) median of each variable, and 4) average of all data (which has the same value as the average of each variable).The purpose of the trial was to compare the sample average MSR/volume values of the biclusters formed.This test is applied to the minimum number of rows and columns that are the smallest, namely a minimum row = 2 and a minimum column = 2 as shown in.
Table III shows the results of the comparison of the four threshold test scenarios in the scaling data matrix binarization process.In the first scenario, using the system threshold will produce a binary matrix with a proportion of 0 elements of 0.976.The value of this proportion is very large and unbalanced so it does not form a bicluster output.Furthermore, in scenarios 2 to 4, a fairly balanced proportion value is obtained, which is close to 0.5 and produces a large number of biclusters, so it can be assumed that the binarization of scenarios 2 to 4 can provide sufficient information.In this case, the scenario that has the smallest average MSR/volume value is scenario 4. So that the median value of all data will be used as the basis for binarization as data that will be used for biclustering analysis using the BCBimax algorithm.
The BCBimax algorithm will work by clustering all submatrices whose elements have a value of 1.In this study, 60 minimum threshold combinations of rows and columns will be tested.The minimum column threshold will be taken from 2, 3, to 7, while for the highest threshold, 7 is taken from half of the total number of research variables.The minimum row thresholds generally follow the column minimum thresholds, namely 2, 3, and up to 7. Then for further exploration, a minimum value of rows 10, 13, 15, and 20 will be tried so that there are 60 trial threshold combinations.The highest number of biclusters is 41 resulting from a combination of very small thresholds, namely a minimum row and column of 2. The biclusters that are formed will be bigger when the threshold is smaller, and vice versa, the bigger the threshold value, the biclusters that are formed will be smaller/less.The median of all data (-0.098)0.487 41 0.006 The average value of MSR per volume is presented in Fig. 3 which follows the elbow method where this method is commonly used in k-means clustering analysis where the optimal number of clusters will be located at the elbow at a certain point.In this case, the row and column thresholds will produce biclustering information if they have a small average MSR value per volume and form an angle at a point.The results of the calculation of the average MSR per volume that experienced a significant decrease between the 2 threshold points were then followed by a relatively constant value at the minimum threshold in row 7 and column 2 with an average MSR value per volume of 0.0036.This threshold forms a total of 9 biclusters.Thus, the optimal bicluster group from the BCBimax algorithm is at the minimum threshold of row 7 and minimum column 2.
Furthermore, the characteristics of the optimal bicluster results will be presented in Table IV.Based on the table, it can be seen that the BCBimax algorithm in the combination of row 7 and column 2 thresholds will produce clusters of 77 countries or 74.7% in the 9 biclusters formed.There are no overlapping countries in the bicluster or other words the countries in each bicluster are unique.Meanwhile, 26 countries, or 25.3% of the countries in the study that were not included in the bicluster were countries that had a negative average value on the results of standardization or had an average value that was smaller than the mean value of the entire scaling data.This is consistent with the performance of the BCBimax algorithm wher e this algorithm will only find element "1" which in this study element "1" is the one with a value greater than the median according to the table described in.In addition, another condition for countries that are not included in the bicluster is because these countries do not have the same characteristics of international trade potential as other countries.As for the membership of the variables, it appears that several variables appear in more than one bicluster that is formed.In this case, the overlap in the variables is not a significant problem considering that a country can have several indicators of international trade potential.The mapping of international trade potential in Indonesia's export destination countries can be seen in Fig. 4 with a map.In bicluster 1 with orange color, the majority of its members are found on the European continent, namely Belgium, Estonia, France, Luxembourg, Malta, and the United Kingdom, as well as the United States on the American continent.In bicluster 2 the red color is also dominated by countries from Europe, namely Denmark, Finland, Germany, Greece, Hungary, Italy, Latvia, Lithuania, Norway, and Sweden.In addition, there are Japan and Australia as well as New Zealand.This second Bicluster has the most members compared to the other biclusters.Furthermore, in bicluster 3 and 4, it is also dominated by European countries, considering that countries on the European continent also dominate the sample in this study.Most of the Bicluster 5 come from the Americas as well as Bicluster 9. Most countries from the Asian continent are in Bicluster 6 and 8, then Bicluster 7 is dominated by countries from Africa.
Furthermore, to find out the similarity of trading potential, a membership plot will be displayed in Fig. 5.
Each row in the graph shows a variable while the column will represent the bicluster group.The green rectangles depict trade potential variables with the same resemblance so that they are grouped into one bicluster.The brighter the middle color produced, the more countries that have high international trade potential.The light color in this case is interpreted as the color of the light version of the box on the side.The X7 and X14 variables have the brightest color.There are 69.9% of countries with a high proportion of individual internet users and 67.9% of countries with high state income so it can be seen that these two variables appear the most in the biclusters formed, namely in biclusters 1 to 6. Likewise with X2 and X5 where there are 66.9% of countries with low import tariffs and 64.1% of countries with a country that is close to Indonesia.These two variables are also the variables that appear the most in the bicluster.Variables X8 and X15 have a dark color and there are only 34.9% of countries with high per capita GDP and 43.7% of countries with high economic development categories.

IV. CONCLUSION
This study applies the BCBimax algorithm to potential international trade data with 103 countries in the world.Based on an experiment of 60 threshold and row combinations, the BCBimax algorithm produces an optimal bicluster at the combination of rows and columns (7,2) with an average MSR/volume value of 0.0036.The binarization threshold experiment has been carried out and the use of the median value for all data produces the most biclusters and the average MSR/volume which tends to be small compared to the other three experiments.All countries with high trade potential are well clustered and spread out in 9 biclusters.Biclusters are formed have unique country characteristics and the majority of countries on the European continent have good trade potential compared to countries from other continents.Several countries in Africa also have trade potential that is quite good so that it can be used as a consideration for the government to open markets and cooperate.The variables that appear most frequently in the nine biclusters are X2 (import tariffs), X7 (proportion of individual internet users), and X14 (country income category) because the average data on this variable is more from the median, meaning that in this data the majority of countries in the world have import tariffs low, high proportion of individual internet users and state income middle to upper to high.

Fig. 2
Fig. 2 Average value of MSR per volume

Fig. 3
Fig. 3 Distribution map of optimal bicluster result by country without scale

TABLE II THE
RESULTS OF THE TRANSFORMATION USING THE INTERVAL SUCCESSIVE METHOD