Analysis of K-NN with the Integration of Bag of Words, TF-IDF, and N-Grams for Hate Speech Classification on Twitter

Kuncoro Hadi; Ema Utami

doi:10.30595/juita.v12i2.23829

Authors

Kuncoro Hadi Universitas Amikom Yogyakarta
Ema Utami Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.30595/juita.v12i2.23829

Keywords:

hate speech, K-Nearest Neighbors, Bag of Words, TF-IDF, N-Grams, F1 Score

Abstract

Social media has emerged as one of the primary communication channels in the modern world, but it has simultaneously become a platform where hate speech can spread easily. This study attempts to evaluate the performance of a hate speech classification model using the K-Nearest Neighbors (K-NN) algorithm along with various feature extraction techniques, specifically Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and N-Grams. The dataset used in this study consists of 13169 entries, which represent a diverse range of hate speech examples commonly encountered on social media platforms. In this experimental investigation, we assess the efficacy of the model using each feature extraction technique. The findings reveal that the K-NN model exhibits optimal performance when the k parameter is set to 3 (k=3). Under this configuration, the model achieves an accuracy of 86.88%, with a precision of 88.27%, a recall of 86.88%, and an F1-Score of 86.50%. These results show that the integration of TF-IDF feature extraction technique with K-NN algorithm produces superior performance in hate speech classification.

Author Biographies

Kuncoro Hadi, Universitas Amikom Yogyakarta

Magister Informatika

Ema Utami, Universitas Amikom Yogyakarta

Magister Informatika

References

[1] M. Subramanian, V. Easwaramoorthy Sathiskumar, G. Deepalakshmi, J. Cho, and G. Manikandan, “A survey on hate speech detection and sentiment analysis using machine learning and deep learning models,” Oct. 2023, Elsevier B.V. pp. 110-121, doi: 10.1016/j.aej.2023.08.038.

[2] D. R. Beddiar, M. S. Jahan, and M. Oussalah, “Data expansion using back translation and paraphrasing for hate speech detection,” Online Soc Netw Media, vol. 24, Jul. 2021, p. 100153, doi: 10.1016/j.osnem.2021.100153.

[3] A. P. J. Dwitama, “Deteksi Ujaran Kebencian Pada Twitter Bahasa Indonesia Menggunakan Machine Learning: Reviu Literatur,” Jurnal Sains, Nalar, dan Aplikasi Teknologi Informasi, vol. 1, no. 1, Aug. 2021, pp. 33-41 doi: 10.20885/snati.v1i1.5.

[4] K. Mutisari Hana, S. Al Faraby, and A. Bramantoro, “Multi-label Classification of Indonesian Hate Speech on Twitter Using Support Vector Machines,” 2020, pp. 1-7, doi: 10.1109/ICoDSA50139.2020.9212992.

[5] H. C. Husada and A. S. Paramita, “Analisis Sentimen Pada Maskapai Penerbangan di Platform Twitter Menggunakan Algoritma Support Vector Machine (SVM),” Teknika, vol. 10, no. 1, pp. 18–26, Feb. 2021, doi: 10.34148/teknika.v10i1.311.

[6] M. S. Jahan and M. Oussalah, “A systematic review of hate speech automatic detection using natural language processing,” Aug. 14, 2023, Elsevier B.V. p. 126323, doi: 10.1016/j.neucom.2023.126232.

[7] Rini, E. Utami, and A. D. Hartanto, “Systematic Literature Review of Hate Speech Detection with Text Mining,” in 2020 2nd International Conference on Cybernetics and Intelligent System, ICORIS 2020, Institute of Electrical and Electronics Engineers Inc., Oct. 2020, pp. 1-6, doi: 10.1109/ICORIS50180.2020.9320755.

[8] M. S. Asramanggala, S. S. Prasetyowati, and Y. Sibaroni, “Optimal Number Data Trains in Hoax News Detection of Indonesian using SVM and Word2Vec,” Building of Informatics, Technology and Science (BITS), vol. 5, no. 1, Jun. 2023, pp. 21-28, doi: 10.47065/bits.v5i1.3516.

[9] M. O. Ibrohim and I. Budi, “Hate speech and abusive language detection in Indonesian social media: Progress and challenges,” Aug. 01, 2023, Elsevier Ltd. p. e18647, doi: 10.1016/j.heliyon.2023.e18647.

[10] M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,” 2019, pp.46-57, doi: 10.18653/v1/W19-3506.

[11] M. A. Gumilang, T. Dwi Puspitasari, F. Wulandari, E. Antika, H. A. Putranto, and A. Samsudin, “Implementation of K-Nearest Neighbor For Classify Hate Speech on Twitter,” in Proceedings - IEIT 2023: 2023 International Conference on Electrical and Information Technology, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 113–119. doi: 10.1109/IEIT59852.2023.10335596.

[12] D. Mengliev, M. Eshkulov, V. Barakhnin, R. Abdullayev, N. Boltayev, and B. Ibragimov, “Linguistic Nuances in Text Analysis: TF-IDF Metric’s Algorithm Implementation for the Karakalpak Language Recognition,” in Proceedings - 2024 IEEE Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, USBEREIT 2024, Institute of Electrical and Electronics Engineers Inc., 2024, pp. 19–22. doi: 10.1109/USBEREIT61901.2024.10584051.

[13] G. Ubale and S. Gaikwad, “SMS Spam Detection Using TFIDF and Voting Classifier,” in 2022 International Mobile and Embedded Technology Conference, MECON 2022, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 363–366. doi: 10.1109/MECON53876.2022.9752078.

[14] S. Jain, Sapan. , K. Jain, and S. Vasal, “An Effective TF-IDF Model to Improve the Text - Classification Performance,” International Conference on Communication Systems and Network Technologies, pp. 1066–1069, 2024, doi: 10.1109/CSNT60213.2024.10545818.

[15] L. Du and C. Hu, “Text similarity detection method of power customer service work order based on TFIDF algorithm,” in 2022 IEEE 5th International Conference on Information Systems and Computer Aided Education, ICISCAE 2022, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 978–982. doi: 10.1109/ICISCAE55891.2022.9927512.

[16] A. E. Qasem and M. Sajid, “Exploring the Effect of N-grams with BOW and TF-IDF Representations on Detecting Fake News,” in 2022 International Conference on Data Analytics for Business and Industry, ICDABI 2022, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 741–746. doi: 10.1109/ICDABI56818.2022.10041537.

[17] J. Asian, O. T. N. K. Putra, M. A. Ayu, and T. Mantoro, “Sentiment Analysis with N-Gram Preprocessing for Online-Shopping Reviews in Indonesian Language,” in 2023 IEEE 9th International Conference on Computing, Engineering and Design, ICCED 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 1-6, doi: 10.1109/ICCED60214.2023.10425567.

[18] G. N. A. Atillo, B. D. Gerardo, and R. P. Medina, “Twitter Sentiment Analysis with Maximum Entropy and Naive Bayes Using N -gram Approach,” in Proceedings of 2023 International Conference on Information Management and Technology, ICIMTech 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 368–372. doi: 10.1109/ICIMTech59029.2023.10277786.

[19] P. Tijare, “Event Labeling Approach for Twitter Datasets Leveraging N-grams, Topics, and Machine Learning Algorithms for Enhanced Event Detection,” in 4th International Conference on Communication, Computing and Industry 6.0, C216 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 1-6, doi: 10.1109/C2I659362.2023.10430550.

[20] E. Payares, E. Puertas, and J. C. Martinez-Santos, “Quantum N-Gram Language Models for Tweet Classification,” in Proceedings - 2023 IEEE 5th International Conference on Cognitive Machine Intelligence, CogMI 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 69–74. doi: 10.1109/CogMI58952.2023.00019.

[21] Z. Guo, Q. Li, X. Li, M. Xiao, R. Hu, and Y. Jiang, “SQL Injection Detection Method Based on N-Gram and TFIDF,” in Proceedings - 2023 International Seminar on Computer Science and Engineering Technology, SCSET 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 204–207. doi: 10.1109/SCSET58950.2023.00053.

[22] R. Sathishkumar, T. Karthikeyan, K. P. Praveen, and S. M. Shamsundar, “Ensemble Text Classification with TF-IDF Vectorization for Hate Speech Detection in Social Media,” in 2023 International Conference on System, Computation, Automation and Networking, ICSCAN 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 1-7, doi: 10.1109/ICSCAN58655.2023.10395354.

[23] E. Utami, Rini, A. F. Iskandar, and S. Raharjo, “Multi-Label Classification of Indonesian Hate Speech Detection Using One-vs-All Method,” in Proceedings - 2021 IEEE 5th International Conference on Information Technology, Information Systems and Electrical Engineering: Applying Data Science and Artificial Intelligence Technologies for Global Challenges During Pandemic Era, ICITISEE 2021, Institute of Electrical and Electronics Engineers Inc., 2021, pp. 78–82. doi: 10.1109/ICITISEE53823.2021.9655883.

[24] E. Puraivan, R. Venegas, and F. Riquelme, “An empiric validation of linguistic features in machine learning models for fake news detection,” Data Knowl Eng, vol. 147, Sep. 2023, p. 102207, doi: 10.1016/j.datak.2023.102207.

[25] H. Kibriya, A. Siddiqa, W. Z. Khan, and M. K. Khan, “Towards safer online communities: Deep learning and explainable AI for hate speech detection and classification,” Computers and Electrical Engineering, vol. 116, May 2024, p. 109153, doi: 10.1016/j.compeleceng.2024.109153.

[26] S. T. Rabani, A. M. Ud Din Khanday, Q. R. Khan, U. A. Hajam, A. S. Imran, and Z. Kastrati, “Detecting suicidality on social media: Machine learning at rescue,” Egyptian Informatics Journal, vol. 24, no. 2, pp. 291–302, Jul. 2023, doi: 10.1016/j.eij.2023.04.003.

[27] N. Sevani, I. A. Soenandi, Adianto, and J. Wijaya, “Detection of Hate Speech by Employing Support Vector Machine with Word2Vec Model,” in 7th International Conference on Electrical, Electronics and Information Engineering: Technological Breakthrough for Greater New Life, ICEEIE 2021, Institute of Electrical and Electronics Engineers Inc., 2021, pp. 1-5, doi: 10.1109/ICEEIE52663.2021.9616721.

[28] M. Wankhade, A. C. S. Rao, and C. Kulkarni, “A survey on sentiment analysis methods, applications, and challenges,” Artif Intell Rev, vol. 55, no. 7, pp. 5731–5780, Oct. 2022, doi: 10.1007/s10462-022-10144-1.

[29] A. Toktarova, D. Syrlybay, B. Myrzakhmetova, G. Anuarbekova, G. Rakhimbayeva, and B. Zhylanbaeva, “Hate Speech Detection in Social Networks using Machine Learning and Deep Learning Methods,” 2023, pp, 396-406, doi: 10.14569/IJACSA.2023.0140542.

[30] S. Akuma, T. Lubem, and I. T. Adom, “Comparing Bag of Words and TF-IDF with different models for hate speech detection from live tweets,” International Journal of Information Technology (Singapore), vol. 14, no. 7, pp. 3629–3635, Dec. 2022, doi: 10.1007/s41870-022-01096-4.

[31] R. Raut and F. Spezzano, “Enhancing hate speech detection with user characteristics,” Int J Data Sci Anal, pp. 1–11, Aug. 2023, doi: 10.1007/s41060-023-00437-1.

[32] G. Mustafa, M. Usman, M. T. Afzal, A. Shahid, and A. Koubaa, “A comprehensive evaluation of metadata-based features to classify research paper’s topics,” IEEE Access, vol. 9, pp. 133500–133509, 2021, doi: 10.1109/ACCESS.2021.3115148.

[33] N. S. Mullah and W. M. N. W. Zainon, “Advances in Machine Learning Algorithms for Hate Speech Detection in Social Media: A Review,” 2021, Institute of Electrical and Electronics Engineers Inc. pp. 88364-88376, doi: 10.1109/ACCESS.2021.3089515.

[34] N. Azmi Verdikha, R. Habid, and A. Johar Latipah, “Analisis DistilBERT dengan Support Vector Machine (SVM) untuk Klasifikasi Ujaran Kebencian pada Sosial Media Twitter,” METIK JURNAL, vol. 7, no. 2, pp. 101–110, Dec. 2023, doi: 10.47002/metik.v7i2.583.

[35] T. Winarti, H. Indriyawati, V. Vydia, and F. W. Christanto, “Performance comparison between naive bayes and k-nearest neighbor algorithm for the classification of indonesian language articles,” IAES International Journal of Artificial Intelligence, vol. 10, no. 2, pp. 452–457, 2021, doi: 10.11591/IJAI.V10.I2.PP452-457.

[36] G. Muppala and T. Devi, “Accurate Recasting of Giant Text into Charts Using Rapid Automatic Keyword Extraction Algorithm in Comparison with Bag of Words Algorithm,” in Proceedings of International Conference on Contemporary Computing and Informatics, IC3I 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 2548–2552. doi: 10.1109/IC3I59117.2023.10397804.

[37] S. Chawla, R. Kaur, and P. Aggarwal, “Text classification framework for short text based on TFIDF-FastText,” Multimed Tools Appl, vol. 82, no. 26, pp. 40167–40180, Nov. 2023, doi: 10.1007/s11042-023-15211-5.

[38] G. N. A. Atillo, B. D. Gerardo, and R. P. Medina, “Sentiment Analysis in Product Reviews with Maximum Entropy and Naïve Bayes Using N-gram Method,” in 2023 6th International Conference on Information and Communications Technology, ICOIACT 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 522–526. doi: 10.1109/ICOIACT59844.2023.10455843.

[39] S. Shekhar, N. Hoque, and D. K. Bhattacharyya, “PKNN-MIFS: A Parallel KNN Classifier over an Optimal Subset of Features,” Intelligent Systems with Applications, vol. 14, p. 73, 2022, doi: 10.1016/j.iswa.2022.20.

[40] S. R. Cholil, T. Handayani, R. Prathivi, and T. Ardianita, “IJCIT (Indonesian Journal on Computer and Information Technology) Implementasi Algoritma Klasifikasi K-Nearest Neighbor (KNN) Untuk Klasifikasi Seleksi Penerima Beasiswa,” 2021, pp. 118-127, doi: 10.31294/ijcit.v6i2.10438.

[41] Z. Qavidel Fard, Z. S. Zomorodian, and S. S. Korsavi, “Application of machine learning in thermal comfort studies: A review of methods, performance and challenges,” Energy Build, vol. 256, p. 111771, 2022, doi: 10.1016/j.enbuild.2021.111771.

[42] E. Helmud, E. Helmud, F. Fitriyani, and P. Romadiana, “Classification Comparison Performance of Supervised Machine Learning Random Forest and Decision Tree Algorithms Using Confusion Matrix,” Jurnal Sisfokom (Sistem Informasi dan Komputer), vol. 13, no. 1, pp. 92–97, Feb. 2024, doi: 10.32736/sisfokom.v13i1.1985.

[43] R. Yacouby Amazon Alexa and D. Axman Amazon Alexa, “Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models,” Nov. 2020, pp. 79-91, doi: 10.18653/v1/2020.eval4nlp-1.9.

[44] R. Prasetyo Vincentius and H. Samudra Anton, “Hate Speech Content Detection System on Twitter using K- Nearest Neighbor Method,” AIP Conf Proc, Apr. 2022, pp. 050001-1-050001-10, doi: 10.1063/5.0080185.

[45] N. H. Cahyana, S. Saifullah, Y. Fauziah, A. S. Aribowo, and R. Drezewski, “Semi-supervised Text Annotation for Hate Speech Detection using K-Nearest Neighbors and Term Frequency-Inverse Document Frequency,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 10, pp. 147–151, 2022, doi: 10.14569/IJACSA.2022.0131020.