Comparative Analysis of BERT Model and DistilBERT Model for Enhanced Clickbait Headline Structure Detection in Indonesian Online News

Rananggana Trustha Dewangga; Budi Prasetiyo

doi:10.30595/juita.v13i3.26479

Authors

Rananggana Trustha Dewangga Universitas Negeri Semarang
Budi Prasetiyo Universitas Negeri Semarang

DOI:

https://doi.org/10.30595/juita.v13i3.26479

Keywords:

clickbait, text classification, natural language processing, deep learning, transformers based model

Abstract

Clickbait uses sensational or misleading headlines to attract readers, which can degrade information quality in online news. This study presents a comparative evaluation of BERT and DistilBERT for detecting clickbait headline structures in the Indonesian language using the CLICK-ID dataset. The approach examines how class imbalance influences performance by training models on multiple dataset variants created through oversampling, undersampling, and data augmentation. Inputs are tokenized with model specific tokenizers and evaluated with accuracy, precision, recall, and F1-score. Confusion matrices are used to interpret error patterns across classes. Experimental results show that DistilBERT trained on an oversampled dataset achieves 94% for accuracy, precision, recall, and F1-score, while BERT on the same oversampled setting reaches 93%. Models trained on unbalanced data yield the lowest recall and F1 for the clickbait class, confirming the adverse effect of skewed distributions. Augmented and undersampled variants produce slightly lower but competitive results in the 92% to 93% range. Error analysis shows that DistilBERT reduces missed clickbait while maintaining a similar level of false positives, producing more balanced behavior across classes. These results outperform prior CLICK-ID studies and highlight the advantage of transformer architectures combined with effective class balancing for Indonesian clickbait detection.

References

[1] A. Agrawal, “Clickbait detection using deep learning,” in 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), 2016, pp. 268–272. doi: 10.1109/NGCT.2016.7877426.

[2] A. Chakraborty, B. Paranjape, S. Kakarla, and N. Ganguly, “Stop Clickbait: Detecting and preventing clickbaits in online news media,” in Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2016, 2016, pp. 9–16. doi: 10.1109/ASONAM.2016.7752207.

[3] K. Scott, “You won’t believe what’s in this paper! Clickbait, relevance and the curiosity gap,” J. Pragmat., vol. 175, pp. 53–66, Apr. 2021, doi: 10.1016/J.PRAGMA.2020.12.023.

[4] V. Kumar, D. Khattar, S. Gairola, Y. Kumar Lal, and V. Varma, “Identifying Clickbait: A multi-strategy approach using neural networks,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, in SIGIR ’18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 1225–1228. doi: 10.1145/3209978.3210144.

[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2019. doi: 10.48550/arXiv.1810.04805.

[6] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

[7] Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013, doi: 10.1109/TPAMI.2013.50.

[8] J. Liu, W.-C. Chang, Y. Wu, and Y. Yang, “Deep learning for extreme multi-label text classification,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, in SIGIR ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 115–124. doi: 10.1145/3077136.3080834.

[9] A. William and Y. Sari, “CLICK-ID: A novel dataset for Indonesian clickbait headlines,” Data Br., vol. 32, p. 106231, Oct. 2020, doi: 10.1016/J.DIB.2020.106231.

[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, pp. 5999–6009, 2017.

[11] M. N. Fakhruzzaman, S. Z. Jannah, R. A. Ningrum, and I. Fahmiyah, “Flagging clickbait in Indonesian online news websites using fine-tuned transformers,” Int. J. Electr. Comput. Eng., vol. 13, no. 3, pp. 2921–2930, 2023.

[12] P. R. Togatorop, A. M. F. Tarigan, A. H. P. Sinaga, and E. P. D. Sidabutar, “Using deep learning and word embedding to detect clickbait in Indonesian headline news,” in 2023 International Conference of Computer Science and Information Technology (ICOSNIKOM), 2023, pp. 1–6. doi: 10.1109/ICoSNIKOM60230.2023.10364558.

[13] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, and T. Le Scao, “HuggingFace’s Transformers: State-of-the-art Natural Language Processing,” 2020, [Online]. Available: https://arxiv.org/abs/1910.03771

[14] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in NLP,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 3645–3650. doi: 10.18653/v1/P19-1355.

[15] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” Oct. 2019, [Online]. Available: http://arxiv.org/abs/1910.01108

[16] P. Mooijman, C. Catal, B. Tekinerdogan, A. Lommen, and M. Blokland, “The effects of data balancing approaches: A case study,” Appl. Soft Comput., vol. 132, p. 109853, Jan. 2023, doi: 10.1016/J.ASOC.2022.109853.

[17] J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” in EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019, pp. 6382–6388. doi: 10.18653/v1/d19-1670.

[18] M. Wasikowski and X. Chen, “Combating the small sample class imbalance problem using Feature Selection,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1388–1400, 2010, doi: 10.1109/TKDE.2009.187.

[19] G. O. Assunção, R. Izbicki, and M. O. Prates, “Is augmentation effective to improve prediction in imbalanced text datasets?,” 2023. [Online]. Available: http://arxiv.org/abs/2304.10283

[20] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Neural Networks, vol. 106, pp. 249–259, 2018, doi: https://doi.org/10.1016/j.neunet.2018.07.011.

[21] A. J. Keya, M. A. H. Wadud, M. F. Mridha, M. Alatiyyah, and M. A. Hamid, “AugFake-BERT: Handling imbalance through augmentation of fake news using BERT to enhance the performance of fake news classification,” Appl. Sci., vol. 12, no. 17, 2022, doi: 10.3390/app12178398.

[22] G. Liu, Y. Yang, and B. Li, “Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning,” Knowledge-Based Syst., vol. 158, pp. 154–174, Oct. 2018, doi: 10.1016/J.KNOSYS.2018.05.044.

[23] G. Wei, W. Mu, Y. Song, and J. Dou, “An improved and random synthetic minority oversampling technique for imbalanced data,” Knowledge-Based Syst., vol. 248, 2022, doi: 10.1016/j.knosys.2022.108839.

[24] C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text data augmentation for deep learning,” J. Big Data, vol. 8, no. 1, p. 101, 2021, doi: 10.1186/s40537-021-00492-0.

[25] N. Rai, D. Kumar, N. Kaushik, C. Raj, and A. Ali, “Fake news classification using transformer based enhanced LSTM and BERT,” Int. J. Cogn. Comput. Eng., vol. 3, pp. 98–105, Jun. 2022, doi: 10.1016/j.ijcce.2022.03.003.

[26] C. Wirawan, “Indonesian BERT base model (uncased),” Hugging Face. Accessed: Jul. 14, 2023. [Online]. Available: https://huggingface.co/cahya/bert-base-indonesian-1.5G

[27] G. I. Winata, A. F. Aji, S. Cahyawijaya, R. Mahendra, F. Koto, A. Romadhony, K. Kurniawan, D. Moeljadi, R. E. Prasojo, P. Fung, T. Baldwin, J. H. Lau, R. Sennrich, and S. Ruder, “NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages,” 2023. [Online]. Available: https://arxiv.org/abs/2205.15960

[28] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP,” arXiv Prepr. arXiv2011.00677, 2020.

[29] F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A pretrained language model for Indonesian Twitter with effective domain-specific vocabulary initialization,” 2021. [Online]. Available: https://arxiv.org/abs/2109.04607

[30] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” 2019. [Online]. Available: http://arxiv.org/abs/1907.11692