Transfer Learning-Based Detection of Dysarthric Speech Using Lightweight Convolutional Neural Networks

Henry Ardian Irianta; Abdul Fadlil; Rusydi Umar

doi:10.30595/juita.v13i3.27695

Authors

Henry Ardian Irianta University of Siber Muhammadiyah
Abdul Fadlil University of Ahmad Dahlan
Rusydi Umar University of Ahmad Dahlan

DOI:

https://doi.org/10.30595/juita.v13i3.27695

Keywords:

automatic speech recognition, dysarthria, lightweight model, transfer learning, MFCC

Abstract

Automatic Speech Recognition (ASR) for a typical speech, such as dysarthria, presents a significant challenge due to high acoustic variability, which often leads to failures in standard models. This challenge is further compounded when implementation is targeted for edge devices with limited computational resources, memory, and power. The need for model architectures that are not only accurate but also highly efficient (lightweight) is crucial for realizing on-device ASR systems with low latency. This research focuses on exploring modern deep learning architectures to address these two primary challenges: accuracy in dysarthric speech and computational efficiency. The study aims to implement and evaluate three efficient models—MobileNetV3Small, EfficientNetB0, and NASNetMobile—on the UASpeech and TORGO datasets. The methodology involves extracting Mel-Frequency Cepstral Coefficients (MFCC) features, which are visualized as spectrograms and subsequently classified using a transfer learning approach. Experimental results show that the MobileNetV3Small model achieved the highest performance on the UASPEECH dataset, attaining a uniform score of 97,8 % for accuracy. This study concludes that lightweight CNN architectures like MobileNetV3Small are highly effective for dysarthric speech classification and demonstrate the feasibility of developing robust and practical ASR systems for resource-constrained environments.

References

[1] R. Chiaramonte and M. Vecchio, “A Systematic Review of Measures of Dysarthria Severity in Stroke Patients,” PM R, vol. 13, no. 3, pp. 314–324, 2021, doi: 10.1002/pmrj.12469.

[2] Q. Miao, M. Zhang, J. Cao, and S. Q. Xie, “Reviewing high-level control techniques on robot-assisted upper-limb rehabilitation,” Adv. Robot., vol. 32, no. 24, pp. 1253–1268, 2018, doi: 10.1080/01691864.2018.1546617.

[3] S. Hegde, S. Shetty, S. Rai, and T. Dodderi, “A Survey on Machine Learning Approaches for Automatic Detection of Voice Disorders,” J. Voice, vol. 33, no. 6, pp. 947.e11-947.e33, 2019, doi: 10.1016/j.jvoice.2018.07.014.

[4] S. R. Mani Sekhar, G. Kashyap, A. Bhansali, A. Andrew Abishek, and K. Singh, “Dysarthric-speech detection using transfer learning with convolutional neural networks,” ICT Express, vol. 8, no. 1, pp. 61–64, 2022, doi: 10.1016/j.icte.2021.07.004.

[5] L. Ben Letaifa and J. L. Rouas, “Transformer Model Compression for End-to-End Speech Recognition on Mobile Devices,” Eur. Signal Process. Conf., vol. 2022-Augus, pp. 439–443, 2022, doi: 10.23919/eusipco55093.2022.9909765.

[6] Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello Edge: Keyword Spotting on Microcontrollers,” pp. 1–14, 2017, [Online]. Available: http://arxiv.org/abs/1711.07128

[7] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Laurenzo, “Streaming keyword spotting on mobile devices,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2020-Octob, pp. 2277–2281, 2020, doi: 10.21437/Interspeech.2020-1003.

[8] A. Huang, K. Hall, C. Watson, and S. R. Shahamiri, “A review of automated intelligibility assessment for dysarthric speakers,” 2021 11th Int. Conf. Speech Technol. Human-Computer Dialogue, SpeD 2021, pp. 19–24, 2021, doi: 10.1109/SpeD53181.2021.9587400.

[9] D. Mulfari and M. Villari, “A Voice User Interface on the Edge for People with Speech Impairments,” Electron., vol. 13, no. 7, 2024, doi: 10.3390/electronics13071389.

[10] D. Mulfari, L. Carnevale, and M. Villari, “Toward a lightweight ASR solution for atypical speech on the edge,” Futur. Gener. Comput. Syst., vol. 149, pp. 455–463, 2023, doi: 10.1016/j.future.2023.08.002.

[11] Z. Peng and J. Huang, “Soft rehabilitation and nursing-care robots: A review and future outlook,” Appl. Sci., vol. 9, no. 15, 2019, doi: 10.3390/app9153102.

[12] I. Díaz, J. Catalán, F. Badesa, X. Justo, L. Lledó, A. Ugartemendía, J. J. Gil, J. Díez, N. García-Aracil, "Development of a robotic device for post-stroke home tele-rehabilitation", Advances in Mechanical Engineering, vol. 10, no. 1, 2018. https://doi.org/10.1177/1687814017752302.

[13] P. Mittal, A comprehensive survey of deep learning-based lightweight object detection models for edge devices, vol. 57, no. 9. Springer Netherlands, 2024. doi: 10.1007/s10462-024-10877-1.

[14] Z. Qian and K. Xiao, “A Survey of Automatic Speech Recognition for Dysarthric Speech,” Electron., vol. 12, no. 20, pp. 1–23, 2023, doi: 10.3390/electronics12204278.

[15] U. Irshad, R. Mahum, I. Ganiyu, F. Butt, L. Hidri, T. Ali, A. M. El Sherbeeny, "Utran-dsr: a novel transformer-based model using feature enhancement for dysarthric speech recognition", EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, no. 1, 2024. https://doi.org/10.1186/s13636-024-00368-0.

[16] B. Riyanta, H. A. Irianta, and B. P. Kamiel, “Development of Speech Command Control Based TinyML System for Post-Stroke Dysarthria Therapy Device,” J. Robot. Control, vol. 4, no. 4, pp. 466–478, 2023, doi: 10.18196/jrc.v4i4.15918.

[17] A. Fadlil, L. Perdana, A. Pujiyanta, Herman, H. I. K. Fathurrahman, and M. M. J. Samodro, “Implementation of Dysarthria Identification Using MFCC and Multilayer Perceptron Algorithm,” SSRG Int. J. Electr. Electron. Eng., vol. 12, no. 1, pp. 32–46, 2025, doi: 10.14445/23488379/IJEEE-V12I1P105.

[18] C. C. Wang, C. Te Chiu, and J. Y. Chang, “EfficientNet-eLite: Extremely Lightweight and Efficient CNN Models for Edge Devices by Network Candidate Search,” J. Signal Process. Syst., vol. 95, no. 5, pp. 657–669, 2023, doi: 10.1007/s11265-022-01808-w.

[19] D. Zhao, Z. Qiu, Y. Jiang, X. Zhu, X. Zhang, and Z. Tao, “A depthwise separable CNN-based interpretable feature extraction network for automatic pathological voice detection,” Biomed. Signal Process. Control, vol. 88, no. PB, p. 105624, 2024, doi: 10.1016/j.bspc.2023.105624.

[20] H. Dyoniputri and Afiahayati, “A hybrid convolutional neural network and support vector machine for dysarthria speech classification,” Int. J. Innov. Comput. Inf. Control, vol. 17, no. 1, pp. 111–123, 2021, doi: 10.24507/ijicic.17.01.111.

[21] V. S. Lodagala, S. Ghosh, and S. Umesh, “CCC-WAV2VEC 2.0: Clustering AIDED Cross Contrastive Self-Supervised Learning of Speech Representations,” 2022 IEEE Spok. Lang. Technol. Work. SLT 2022 - Proc., pp. 1–8, 2023, doi: 10.1109/SLT54892.2023.10022552.

[22] J. Yu, X. Xie, S. Liu, S. Hu, M. Lam, X. Wu, K. H. Wong, X. Liu, H. Meng, "Development of the CUHK dysarthric speech recognition system for the UA Speech corpus", Interspeech, 2018. https://doi.org/10.21437/interspeech.2018-1541.

[23] F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” Lang. Resour. Eval., vol. 46, no. 4, pp. 523–541, 2012, doi: 10.1007/s10579-011-9145-0.

[24] E. B. Sudewo, M. Kunta Biddinika, R. Umar, and A. Fadlil, “Evaluating the Impact of Optimizer Hyperparameters on ResNet in Hanacaraka Character Recognition,” Preserv. Digit. Technol. Cult., pp. 1–11, 2025, doi: 10.1515/pdtc-2024-0061.

[25] A. Howard, W. Wang, G. Chu, L. Chen, B. Chen, and M. Tan, “Searching for MobileNetV3 Accuracy vs MADDs vs model size,” Int. Conf. Comput. Vis., pp. 1314–1324, 2019.

[26] Q. V. Le Mingxing Tan, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing,” Can. J. Emerg. Med., vol. 15, no. 3, p. 190, 2013.

[27] B. Zoph, V. Vasudevan, J. Shlens, and Q. V Le, “Zoph_Learning_Transferable_Architectures_CVPR_2018_paper.pdf,” Proc. IEEE Conf. Comput. Vis. pattern Recognit., pp. 8697–8710, 2018.

[28] A. Peryanto, A. Yudhana, and R. Umar, “Klasifikasi Citra Menggunakan Convolutional Neural Network dan K Fold Cross Validation,” J. Appl. Informatics Comput., vol. 4, no. 1, pp. 45–51, 2020, doi: 10.30871/jaic.v4i1.2017.

[29] I. B. Mahendra, I. M. G. Sunarya, and I. M. A. Wirawan, “Comparison of Multinomial, Bernoulli, and Gaussian Naïve Bayes for Complaint Classification in Pro Denpasar Application”, JUITA, vol. 13, no. 1, pp. 77–86, Mar. 2025.

[30] A. R. W. Sait, S. Sankaranarayanan, and P. Gouthaman, “Multi-Feature Fusion-Based Speech Disorder Classification Using MobileNetV3-EfficientNetB7, Linformer-Performer, and SHAP-Aware XGBoost,” IEEE Access, vol. 13, no. May, pp. 83348–83360, 2025, doi: 10.1109/ACCESS.2025.3562232.

[31] A. Wong, M. Famouri, M. Pavlova, and S. Surana, “TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices,” pp. 1–10, 2020, [Online]. Available: http://arxiv.org/abs/2008.04245.