A Comparative Analysis of Transformer-Based Topic Modeling Pipelines for Scientific Literature

Farriel Arrianta Akbar Pratama; Muhammad Eka Nur Arief; Vinna Rahmayanti Setyaning Nastiti

doi:10.30595/juita.v14i1.28346

Authors

Farriel Arrianta Akbar Pratama Universitas Muhammadiyah Malang
Muhammad Eka Nur Arief Universitas Muhammadiyah Malang
Vinna Rahmayanti Setyaning Nastiti Universitas Muhammadiyah Malang

DOI:

https://doi.org/10.30595/juita.v14i1.28346

Keywords:

BERTopic; Natural Language Processing; topic modelling; transformer models.

Abstract

The exponential growth of scientific literature poses a significant challenge for manually identifying thematic trends, necessitating automated analysis methods. This study aims to determine an optimal topic modeling pipeline by conducting a comparative analysis to maximize the coherence of topics extracted from scientific research. Three distinct pipelines were implemented and evaluated on a corpus of 20,972 scientific article abstracts. These included a custom pipeline combining SBERT, UMAP, and HDBSCAN; a second configuration using RoBERTa, PCA, and KMeans; and a third using the integrated BERTopic model. Performance evaluation, quantitatively benchmarked using the C_v coherence score, revealed that the integrated BERTopic model achieved the highest score of 0.7012. This result significantly surpassed the custom SBERT-based pipeline and the RoBERTa-based pipeline, which scored 0.6079 and 0.4756, respectively. The findings demonstrate that an integrated, purpose-built model like BERTopic is superior for generating highly coherent and interpretable thematic structures from scientific text. This research provides empirical guidance for researchers, benchmarking how integrated models offer a more robust solution for large-scale literature analysis compared to modular pipeline designs.

Author Biographies

Farriel Arrianta Akbar Pratama, Universitas Muhammadiyah Malang

Informatics Engineering

Muhammad Eka Nur Arief, Universitas Muhammadiyah Malang

Informatics Engineering

Vinna Rahmayanti Setyaning Nastiti, Universitas Muhammadiyah Malang

Informatics Engineering

References

[1] A. H. Suyanto, T. Djatna, and S. H. Wijaya, “Mapping and predicting research trends in international journal publications using graph and topic modeling,” Indones. J. Electr. Eng. Comput. Sci., vol. 30, no. 2, p. 1201, May 2023, doi: 10.11591/ijeecs.v30.i2.pp1201-1213.

[2] S. Kavvadias, G. Drosatos, and E. Kaldoudi, “Supporting topic modeling and trends analysis in biomedical literature,” J. Biomed. Inform., vol. 110, p. 103574, 2020, doi: https://doi.org/10.1016/j.jbi.2020.103574.

[3] T. Silwattananusarn and P. Kulkanjanapiban, “A text mining and topic modeling based bibliometric exploration of information science research,” IAES Int. J. Artif. Intell. IJ-AI, vol. 11, no. 3, p. 1057, Sept. 2022, doi: 10.11591/ijai.v11.i3.pp1057-1065.

[4] A. Abdelrazek, Y. Eid, E. Gawish, W. Medhat, and A. Hassan, “Topic modeling algorithms and applications: A survey,” Inf. Syst., vol. 112, p. 102131, 2023, doi: https://doi.org/10.1016/j.is.2022.102131.

[5] S. H. Mohammed and S. Al-augby, “LSA & LDA topic modeling classification: comparison study on e-books,” Indones. J. Electr. Eng. Comput. Sci., vol. 19, no. 1, p. 353, July 2020, doi: 10.11591/ijeecs.v19.i1.pp353-362.

[6] R. K. Gupta, R. Agarwalla, B. H. Naik, J. R. Evuri, A. Thapa, and T. D. Singh, “Prediction of research trends using LDA based topic modeling,” Glob. Transit. Proc., vol. 3, no. 1, pp. 298–304, 2022, doi: https://doi.org/10.1016/j.gltp.2022.03.015.

[7] S.-S. SHIN and H.-C. Yang, “A Study on Leadership Trends from the Perspective of Domestic Researcher’s Using BERTopic and LDA,” East Asian J. Bus. Econ. EAJBE, vol. 11, no. 1, pp. 53–71, Mar. 2023, doi: 10.20498/EAJBE.2023.11.1.53.

[8] R. Albalawi, T. H. Yeap, and M. Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” Front. Artif. Intell., vol. 3, p. 42, July 2020, doi: 10.3389/frai.2020.00042.

[9] I. Vayansky and S. A. P. Kumar, “A review of topic modeling methods,” Inf. Syst., vol. 94, p. 101582, Dec. 2020, doi: 10.1016/j.is.2020.101582.

[10] J. Qiang, Z. Qian, Y. Li, Y. Yuan, and X. Wu, “Short Text Topic Modeling Techniques, Applications, and Performance: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1427–1445, Mar. 2022, doi: 10.1109/TKDE.2020.2992485.

[11] R. Egger and J. Yu, “A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts,” Front. Sociol., vol. 7, p. 886498, May 2022, doi: 10.3389/fsoc.2022.886498.

[12] G. Papadia, M. Pacella, M. Perrone, and V. Giliberti, “A Comparison of Different Topic Modeling Methods through a Real Case Study of Italian Customer Care,” Algorithms, vol. 16, no. 2, p. 94, Feb. 2023, doi: 10.3390/a16020094.

[13] M. Hankar, M. Kasri, and A. Beni-Hssane, “A comprehensive overview of topic modeling: Techniques, applications and challenges,” Neurocomputing, vol. 628, p. 129638, May 2025, doi: 10.1016/j.neucom.2025.129638.

[14] Y. Sun, D. Gao, X. Shen, M. Li, J. Nan, and W. Zhang, “Multi-Label Classification in Patient-Doctor Dialogues With the RoBERTa-WWM-ext + CNN (Robustly Optimized Bidirectional Encoder Representations From Transformers Pretraining Approach With Whole Word Masking Extended Combining a Convolutional Neural Network) Model: Named Entity Study,” JMIR Med. Inform., vol. 10, no. 4, p. e35606, Apr. 2022, doi: 10.2196/35606.

[15] R. Silva Barbon and A. T. Akabane, “Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study,” Sensors, vol. 22, no. 21, p. 8184, Oct. 2022, doi: 10.3390/s22218184.

[16] C. Y. Sy, L. L. Maceda, N. M. Flores, and M. B. Abisado, “Unsupervised Machine Learning Approaches in NLP: A Comparative Study of Topic Modeling with BERTopic and LDA”.

[17] A. Madrid-García, D. Freites-Núñez, B. Merino-Barbancho, I. Pérez Sancristobal, and L. Rodríguez-Rodríguez, “Mapping two decades of research in rheumatology-specific journals: a topic modeling analysis with BERTopic,” Ther. Adv. Musculoskelet. Dis., vol. 16, p. 1759720X241308037, Jan. 2024, doi: 10.1177/1759720X241308037.

[18] N. Khodeir and F. Elghannam, “Efficient topic identification for urgent MOOC Forum posts using BERTopic and traditional topic modeling techniques,” Educ. Inf. Technol., vol. 30, no. 5, pp. 5501–5527, Apr. 2025, doi: 10.1007/s10639-024-13003-4.

[19] L. Kun, H. Alli, and K. A. A. A. Rahman, “The Trends of Potential User Research from 2014-2023 Based on Bibliometric and Bertopic,” Rev. Gest. Soc. E Ambient., vol. 18, no. 9, p. e06100, May 2024, doi: 10.24857/rgsa.v18n9-068.

[20] E. Chagnon, R. Pandolfi, J. Donatelli, and D. Ushizima, “Benchmarking topic models on scientific articles using BERTeley,” Nat. Lang. Process. J., vol. 6, p. 100044, Mar. 2024, doi: 10.1016/j.nlp.2023.100044.

[21] M. C. Wijanto, I. Widiastuti, and H.-S. Yong, “Topic Modeling for Scientific Articles: Exploring Optimal Hyperparameter Tuning in BERT,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 14, no. 3, pp. 912–919, June 2024, doi: 10.18517/ijaseit.14.3.19347.

[22] D. Hanny and B. Resch, “Clustering-Based Joint Topic-Sentiment Modeling of Social Media Data: A Neural Networks Approach,” Information, vol. 15, no. 4, p. 200, Apr. 2024, doi: 10.3390/info15040200.

[23] C. Flexa, W. Gomes, I. Moreira, R. Alves, and C. Sales, “Polygonal Coordinate System: Visualizing high-dimensional data using geometric DR, and a deterministic version of t-SNE,” Expert Syst. Appl., vol. 175, p. 114741, Aug. 2021, doi: 10.1016/j.eswa.2021.114741.

[24] X. Han, “Evolution of research topics in LIS between 1996 and 2019: an analysis based on latent Dirichlet allocation topic model,” Scientometrics, vol. 125, no. 3, pp. 2561–2595, Dec. 2020, doi: 10.1007/s11192-020-03721-0.

[25] B. Densil, “Topic Modeling for Research Articles.” Kaggle, 2022. [Online]. Available: https://www.kaggle.com/datasets/blessondensil294/topic-modeling-for-research-articles/data

[26] X. Wu, T. Nguyen, and A. T. Luu, “A survey on neural topic models: methods, applications, and challenges,” Artif. Intell. Rev., vol. 57, no. 2, p. 18, Jan. 2024, doi: 10.1007/s10462-023-10661-7.

[27] K. Datchanamoorthy, A. Mala. G. S, and Padmavathi. B, “TEXT MINING: CLUSTERING USING BERT AND PROBABILISTIC TOPIC MODELING,” Soc. Inform. J., vol. 2, no. 2, pp. 1–13, Dec. 2023, doi: 10.58898/sij.v2i2.01-13.

[28] F. A. A. Pratama, M. E. N. Arief, and V. R. S. Nastiti, “Transformer-Based Topic Modeling Pipeline for Scientific Literature.” Nov. 2025. [Online]. Available: https://github.com/reddishowo/topic-modelling-project

[29] M. Asgari-Chenaghlu, M.-R. Feizi-Derakhshi, L. Farzinvash, M.-A. Balafar, and C. Motamed, “TopicBERT: A cognitive approach for topic detection from multimodal post stream using BERT and memory–graph,” Chaos Solitons Fractals, vol. 151, p. 111274, Oct. 2021, doi: 10.1016/j.chaos.2021.111274.