Non-linear Kernel Optimisation of Support Vector Machine Algorithm for Online Marketplace Sentiment Analysis

.


I. INTRODUCTION
Twitter is a powerful social media presence in the online marketplace, offering businesses a variety of opportunities to reach a broad audience and promote their products.Features such as tweets, retweets and hashtags allow sellers to build brand awareness and interact with potential customers.Twitter also serves as a valuable source of real-time information, allowing sellers to monitor market trends and gather feedback from customers.However, using Twitter effectively comes with several challenges, such as reputation management and privacy regulations.Despite these challenges, Twitter remains a powerful tool to support the growth and success of online marketplaces [1]- [2].
Twitter aside, online marketplaces have changed the business landscape by providing a digital platform for sellers to reach a larger audience and compete with larger companies.These marketplaces also benefit consumers, offering easy access to products, more excellent choices, and the ability to compare prices and read reviews.However, they also bring challenges, such as intense competition and privacy concerns.Regulations in Indonesia, such as the ITE Law and Government Regulation No. 80/2019, aim to ensure fair and orderly trading activities through online marketplaces while protecting consumer rights.
In addition, the use of Support Vector Machines (SVM) in tweet data analysis has a significant impact on understanding public opinion, trends, and reactions on social media platforms such as Twitter.SVMs enable classification and sentiment analysis of tweet data, providing valuable insights for decision-making, brand management and research purposes [3].
In previous research related to the marketplace, sentiment analysis on the shopee application using the SVM method resulted in an accuracy rate of 98% and an f1-score of 98% [4].Analyzing shopee marketplace sentiment using the Naïve Bayes classifier, resulting in an accuracy rate of 90.03% [5].Comparing NBC and SVM in the online marketplace, the accuracy level of SVM is 5% higher than NBC [6].Research related to the shopee product review marketplace where the evaluation results get an accuracy rate of 90.03% using naïve Bayes [7].Sentiment Analysis on Twitter Social Media towards Shopee E-Commerce through Support Vector Machine (SVM) Method kernel SVM accuracy rate 93.20% [8].. Penelitian berikutnya melakukan analisis fake review e-commerce, dengan hasil akurasi terbaik pada model SVM [9].Penelitian selanjutnya B2C E-Commerce Customer Churn Prediction Based on K-Means and SVM dengan hasil model terbaik pada methode SVM [10].This research will compare the performance of polynomial kernels in finding the highest accuracy value in a classification.This research is based on previous research where SVM accuracy results are better in analyzing class balance.The results of this study are to determine the best accuracy in the SVM algorithm study case of online marketplace sentiment analysis.

II. METHOD
Research processes that involve collecting and analyzing data from Twitter usually involve several vital stages that include data crawling, preprocessing, labelling, data splitting, and the use of a SVM [11].The following flow of this research is shown in Fig. 1.

A. Crawling Data
The research flow begins with crawling data from Twitter.In this stage, tweet data is collected from Twitter using web scraping techniques or the Twitter API [12].Data collection must be done carefully to ensure the quality and accuracy of the data to be used in the analysis.

B. Data Preprocessing
Once the tweet data has been collected, the next step is data preprocessing.This preprocessing process involves cleaning and preparing the data for further analysis.This includes removing special characters, addressing missing or duplicate data, and converting the tweet text into a format that can be used by modelling algorithms such as SVM [13]- [14].Preprocessing can also involve text normalization and removal of irrelevant words.Preprocessing flow can be seen at Fig. 2.
There are five major stages in preprocessing to obtain data that has a good level of accuracy: 1) Case folding: the case folding process is a way of converting all text into lowercase lowercase letters, with the aim of helping to prevent errors or mismatches that may arise due to differences in letter size.
2) Text cleaning: the process of cleaning symbols, punctuation, spaces and other characters that are less relevant or noise.
3) Tokenization: changing a sentence into words or tokens.
4) Stopword removal: the process of removing or deleting words that are considered common from a text or words, often do not provide meaning or important information in a natural language analysis approach.
5) Stemming: removing affixes from words so that only the root or basic form remains.

C. Labeling
Next, the tweet data needs to be labelled.This process usually involves the attribution of a specific label or category to each tweet.For example, in sentiment analysis, tweets can be labelled as positive, negative or neutral based on their content.This labelling is important for training the SVM model in order to classify the tweets correctly [15]- [16].
In this study, the dataset derived from Twitter consists of reviews or comments that reflect a favourable point of view, opinion, or assessment of a particular subject or topic that will be labelled as positive or value 1.Meanwhile, reviews or comments that reflect an adverse point of view, opinion, or assessment or are critical of the subject or topic will be labelled as negative or value 0.
The equation for determining labeling can be seen from (1).
The flow of the labelling process can be seen in Fig. 3. Tweet counters work with the help of tables that are customized to the research topic.Table I outlines the criteria used to determine the labels in the document.

D. Term weighting
Term weighting, primarily through the TF-IDF (Term Frequency-Inverse Document Frequency) method, is a technique used in text processing and information retrieval to assign appropriate weights or values to words in a document or text corpus.This technique helps in identifying the extent to which the words are relevant or necessary in a particular context [17]- [18].Here is a more detailed explanation of TF-IDF and its formula.
Term Frequency (TF -Word Frequency): This is a metric that measures how often a word appears in a given document or text corpus.
1) Term Frequency (TF): TF is the frequency of word (t) in document (d).t describes the number of words appearing in corpus (d), and d describes the total documents in corpus (d) as in (2).

𝑇𝐹. 𝐼𝐷𝐹 (𝑡,𝑑,𝐷) = 𝑇𝐹 (𝑡,𝑑) . 𝐼𝐷𝐹 (𝑡,𝐷) 
Using the TF-IDF technique, words that appear frequently in documents but also appear frequently throughout the corpus will have a lower weight.In comparison, words that appear infrequently in documents but are unique in the context of the corpus will have a higher weight.This helps in finding keywords or relevant terms in text analysis, information retrieval, and various other applications in text processing and data mining [19]- [21].

E. Split Data
Data sharing involves splitting the dataset into two parts: training data and testing data.Training data is used to train the model while testing data is used to test the performance of the trained model.This aims to prevent overfitting, where the model overfits the training data and struggles to adapt to new data.This step is crucial to assess how much the model can apply knowledge from the training data to real situations [22].In this study, there are three data split scenarios used, namely 80% training data and 20% testing data, 50% training data and 50% testing data, and 20% training data and 80% testing data.This was done to train the model and measure the level of accuracy that the model can achieve in various data-splitting contexts.

F. Support Vector Machine
SVM is a machine learning algorithm used for classification and sentiment analysis, in this case, to analyze tweet data.The model is trained using an already labelled training set and then used to classify unseen tweets in the test set.The results of using SVM can be sentiment analysis, classifying tweets into specific categories, or other tasks according to the research objectives [23]- [24].Here are the formulas in the nonlinear SVM vector space in Table II.
The accuracy of the SVM kernel model is influenced by the hyperparameters, where 'x' and 'y' indicate the two input vectors that will be projected into the higherdimensional feature space.These hyperparameters have an important impact in determining the extent to which the model can effectively separate and accurately classify data.

G. Evaluation
In research studies, the confusion matrix helps researchers measure the accuracy and effectiveness of the classification models that researchers use.For example, in a medical study, the research may try to classify patients as positive or negative for a condition based on test results [25]- [26].The confusion matrix will help in measuring the extent to which the model can identify patients who are actually positive (True Positive) or negative (True Negative), as well as the extent to which the model can make mistakes by classifying positive patients as unfavourable (False Negative) or vice versa (False Positive).Confusion Matrix is also used to calculate other evaluation metrics such as accuracy, precision, recall, and F1-score, all of which provide deeper insight into the performance of classification models in research studies.By using the Confusion Matrix, researchers can measure and make more accurate and reliable decisions on research results in various research fields [27]- [28].An example of a Confusion Matrix is presented in Table III.

Negative
False Positive (FP) True Negative (TN) 1) Recall: Calculate the success rate in identifying correct cases as correct (5).

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃+𝐹𝑁
 2) Accuracy: The accuracy of the results obtained when compared with other data to assess the extent to which the correct model is accurately used (6).In this report, there is a detailed description of each step that has been taken during the implementation of the research.

A. Preprosesing
Table IV describes the results of the preprocessing steps.The text in the Twitter dataset still contains characters, punctuation marks and uppercase letters, which are then converted into uniform words and lowercase letters.
Table V describes the steps in converting text into a series of word tokens.The data starts as a sentence that has been converted into lowercase letters, and then the data is broken down into tokens that represent each word in the sentence.Then, Table VI describes the steps in removing punctuation and words that lack meaning.
Table VII describes the stages in natural language processing used to remove affixes or word endings from words in the text so as to leave only the basic form or base word [29].Stopword removal results containing affix words are then converted into base words.The primary purpose of this process is to achieve consistency in the structure of the text and make the next stage of analysis more manageable.This process is essential in text processing efforts to analyze the information.
This preprocessing process helps ensure that the data used in analysis or modelling is of good quality, resulting in more accurate and meaningful results.

B. Labeling
Labelling uses the phyton library Textblob.This library can determine whether a text has a positive or negative sentiment.The following labelling dataset can be seen in Table VIII.
The total data of 1276 tweets consist of 538 positively labelled tweets and 738 negatively labelled tweets.Positive is symbolized by the value one, and negative is symbolized by the value 0.

D. Support Vektor Machine
Svm is used to separate data into two sentiment classes, namely positive and negative, using training data on a text document that has been labelled positive and negative.Then, the model is used for new sentiment predictions on new text that the model has never seen.
Parameter grid search is a method to find the best parameter settings for an algorithm or model by testing various combinations of potential parameter values [30].Table X shows the grid search combinations used by researchers.
Parameter grid search is used to find the best parameter settings for an algorithm or model by testing various combinations of potential parameter values.The purpose of parameter grid search is to find the combination of parameters that provide optimal model performance, measured by predefined evaluation metrics such as accuracy, precision, or recall [31].The table presented (Table X) shows the parameter combinations used by researchers in the grid search process.For each type of kernel (Polynomial, Rbf, Sigmoid), the table shows the values tested for each parameter relevant to that kernel.
For example, for the Polynomial kernel, researchers tested combinations of values for the Degree and Coef0 parameters, with values specified within a certain range.Similarly, for the Rbf and Sigmoid kernels, researchers tested combinations of values for the Gamma, C, and Coef0 parameters, also with values specified within a certain range.
The process of determining these parameter combinations is carried out by running the model using every possible combination of parameters, then measuring the model's performance using techniques such as cross-validation or appropriate evaluation methods.By analyzing the results of each parameter combination, researchers can determine which combination of parameters produces the best model performance, thus becoming the optimal parameters to use for the model in a given case Table XI illustrates the accuracy results, including the best parameters and highest accuracy for each non-linear SVM model.The following results are presented in the form of a bar graph as shown in Fig. 4.   The average accuracy across all non-linear SVM kernel parameter alignments is around 89%, with the best performance at 80% training and 20% testing data split on the Rbf kernel.

E. Evaluation
Evaluation of kernel SVM models using a confusion matrix provides valuable information on the model's ability to correctly classify True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) [32].By analyzing the performance results through the confusion matrix, researchers gain an in-depth understanding of the accuracy, precision, recall, and F1score, which provides an overall assessment of how effective the model is and also identifies possible areas for improvement.These metrics are calculated using Equations ( 5), ( 6), (7), and (8).The following results from the confusion matrix on the kernel polynomial model achieved 90% accuracy, 88% precision, 96% recall and 92% f1-score for negative sentiment detection.These results are summarized in Table XII.
The classification results of the model can be seen in Fig. 5, where it was noted that the model successfully identified TP 140, TN 91, FP 19 and FN 6 out of the total tweet data analyzed.The following results from the confusion matrix on the Rbf kernel model achieve 90% accuracy, 88% precision, 95% recall and 91% f1-score for Negative sentiment detection.These results are summarized in Table XIII.
The classification results of the model can be seen in Fig. 6, where it was noted that the model successfully identified TP 139, TN 91, FP 19 and FN 7 out of the total tweet data analyzed.
The following results from the confusion matrix on the sigmoid kernel model achieve 89% accuracy, 88% precision, 95% recall and 91% f1-score for Negative sentiment detection.These results are summarized in Table XIV.The pre-processing steps The classification results of the model can be seen in Fig. 7, where it is noted that the model successfully identified TP 91, TN 92, FP 16 and FN 17 from the total analyzed tweet data.
Rbf kernels can significantly improve accuracy in non-linear SVMs by performing careful tuning of the right hyperparameters.By adjusting parameters such as gamma and C, SVM-RBF is able to effectively adapt its model to cope with complex and non-linear relationships between features in classification tasks, thus producing more accurate results and better fit to the given data [33].Setting these parameters wisely through tuning techniques such as CV can help improve the performance of SVM-RBF in handling non-linear classification tasks.The result is a model that is better able to understand and adapt to complex relationships in the data, which in turn improves accuracy in classifying new data in various application contexts.

IV. CONCLUSION
Sentiment analysis of online marketplaces in Indonesia was conducted on Twitter social media with 1276 tweets data of 538 positive and 538 negative sentiments using the non-linear SVM method.This process includes data preprocessing, weighting labelling using the TF-IDF method, and data separation with three Testing.GridSearchCV combines cross-validation and non-linear SVM parameters for model evaluation using a confusion matrix.The best SVM model from the scenario results obtained the best separation on 80% training and 20% testing data with the best hyperparameter on the Rbf kernel.The optimal parameters obtained from the experimental results of the value of C = 100 and gamma = 0.01 resulted in a model accuracy of 89%.When performed on a model that has never been seen, the accuracy results increase to 90% with an f1-score value of 91%, precision of 88% and recall of 95% on Negative sentiment.In conclusion, the performance evaluation of the non-linear SVM model obtained the highest accuracy results on the Rbf kernel on sentiment towards online marketplaces.The potential to improve the performance of the model can be done by setting the hyperparameters of the non-linear SVM kernel.

3 ) 4 )
Precision: Comparison between correctly classified True Positive cases and the total predicted positive cases (7).F1-Score: Balance the average weight value between precision and recall(8).

Fig. 4 A
Fig. 4 A Visual analysis of an svm graphic representation

Fig. 5
Fig. 5 Confusion matrix for the polynomial kernel

Fig. 7
Fig. 7 Confusion matrix for the sigmoid kernel