Document Type : Research Paper

Authors

1 Computer Science Department-Faculty of Science - Soran University, Soran, Erbil, Kurdistan,

2 University of Tehran (Visitor at Soran University), Faculty of New Sciences and Technologies

10.37652/juaps.2022.176500

Abstract

Automatic Speech Recognition (ASR) as an interesting field of speech processing, is nowadays utilized in real applications which are implemented using various techniques. Amongst them, the artificial neural network is the most popular one. Increasing the performance and making these systems robust to noise are among the current challenges. This paper addresses the development of an ASR system for the Central Kurdish language (CKB) using a transfer learning of Deep Neural Networks (DNN). The combination of Mel-Frequency Cepstral Coefficients (MFCCs) for extracting features of speech signals, Long Short-Term Memory (LSTM) with Connectionist Temporal Classification (CTC) output layer is used to create an Acoustic Model (AM) on the AsoSoft CKB speech dataset.  Also, we have used the N-gram language model on the collected large text dataset which includes about 300 million tokens. The text corpus is also used to extract a dynamic lexicon model that contains over 2.5 million CKB words. The obtained results show that the use of a DNN improves the results compared to classical statistics modules. The proposed method achieves a 0.22%-word error rate by combining transfer learning and language model adaptation. This result is superior to the best-reported result for the CKB.

Keywords

Main Subjects

[1] Wang, D., Wang, X. and Lv, S., 2019. An overview of end-to-end automatic speech recognition. Symmetry, 11(8), p.1018.
[2] Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., Huang, H., Tjandra, A., Zhang, X., Zhang, F. and Fuegen, C., 2020, May. Transformer-based acoustic modeling for hybrid speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6874-6878). IEEE.
[3] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxceleB2: Deep speaker recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018, vol. 2018-September. doi: 10.21437/Interspeech.2018-1929.
[4] Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H. and Alhussain, T., 2019. Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7, pp.117327-117345.
[5] M. Johnson et al., “A systematic review of speech recognition technology in health care,” BMC Medical Informatics and Decision Making, vol. 14, no. 1. 2014. doi: 10.1186/1472-6947-14-94.
[6] Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T. and Lindén, K., 2019. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65, pp.675-782.
[7] Tursunov, A., Choeh, J.Y. and Kwon, S., 2021. Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors, 21(17), p.5892.
[8] Deshmukh, A.M., 2020. Comparison of hidden markov model and recurrent neural network in automatic speech recognition. European Journal of Engineering and Technology Research, 5(8), pp.958-965.
[9] D. Amodei et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” in 33rd International Conference on Machine Learning, ICML 2016, 2016, vol. 1.
[10] Y. Xie, L. Le, Y. Zhou, and V. V. Raghavan, “Deep Learning for Natural Language Processing,” Handbook of Statistics, vol. 38, pp. 317–328, Jan. 2018, doi: 10.1016/BS.HOST.2018.05.001.
[11] S. M. Omer, J. A. Qadir, and Z. K. Abdul, “Uttered Kurdish digit recognition system,” Journal of University of Raparin, vol. 6, no. 2, 2019, doi: 10.26750/vol(6).no(2).paper5.
[12] J. A. Qadir, A. K. Al-Talabani, and H. A. Aziz, “Isolated Spoken Word Recognition Using One-Dimensional Convolutional Neural Network,” International Journal of Fuzzy Logic and Intelligent Systems, vol. 20, no. 4, 2020, doi: 10.5391/IJFIS.2020.20.4.272.
[13] Z. K. Abdul, “Kurdish Spoken Letter Recognition based on k-NN and SVM Model,” Journal of University of Raparin, vol. 7, no. 4, 2020, doi: 10.26750/vol(7).no(4).paper1.
[14] H. Veisi, H. Hosseini, M. Mohammadamini, W. Fathy, and A. Mahmudi, “Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon,” arXiv preprint arXiv:2102.07412, no. Furui 2005, 2021.
[15] A. Hannun et al., “Deep speech: Scaling up end-to-end speech recognition,” arxiv.org.
[16] W. Song and J. Cai, “End-to-End Deep Neural Network for Automatic Speech Recognition,” CS224N Projects, 2015.
[17] … P. L.-2019 34th I. and undefined 2019, “Speech recognition using deep learning,” ieeexplore.ieee.org.
[18] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Audio-visual speech recognition using deep learning,” Applied Intelligence, vol. 42, no. 4, 2015, doi: 10.1007/s10489-014-0629-7.
[19] U. A. KIMANUKA and O. BUYUK, “Turkish Speech Recognition Based On Deep Neural Networks,” Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol. 22, no. Özel, 2018, doi: 10.19113/sdufbed.12798.
[20] H. Veisi and A. Haji Mani, “Persian speech recognition using deep learning,” International Journal of Speech Technology, vol. 23, no. 4, 2020, doi: 10.1007/s10772-020-09768-x.
[21] H. A. Alsayadi, A. A. Abdelhamid, I. Hegazy, and Z. T. Fayed, “Arabic speech recognition using end‐to‐end deep learning,” IET Signal Processing, vol. 15, no. 8, 2021, doi: 10.1049/sil2.12057.
[22] L. Muda, M. Begam, and I. Elamvazuthi, “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques,” Mar. 2010.
[23] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in ACM International Conference Proceeding Series, 2006, vol. 148. doi: 10.1145/1143844.1143891.
[24] de la Fuente Garcia, S., Ritchie, C.W. and Luz, S., 2020. Artificial intelligence, speech, and language processing approaches to monitoring Alzheimer’s disease: a systematic review. Journal of Alzheimer's Disease, 78(4), pp.1547-1574.
[25] Yu, D. and Deng, L., 2016. Automatic speech recognition (Vol. 1). Berlin: Springer.
[26] Markovnikov, N., Kipyatkova, I. and Lyakso, E., 2018, September. End-to-end speech recognition in Russian. In International Conference on Speech and Computer (pp. 377-386). Springer, Cham.
[27] Cabral, F.S., Fukai, H. and Tamura, S., 2019. Feature extraction methods proposed for speech recognition are effective on road condition monitoring using smartphone inertial sensors. Sensors, 19(16), p.3481.
[28] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z. and Liu, T.Y., 2019. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32.
[29] F. S. Cabral, H. Fukai, and S. Tamura, “Feature extraction methods proposed for speech recognition are effective on road condition monitoring using smartphone inertial sensors,” Sensors (Switzerland), vol. 19, no. 16, 2019, doi: 10.3390/s19163481.
[30] H. Veisi, M. MohammadAmini, and H. Hosseini, “Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus,” Digit. Scholarsh. Humanit., 2019, doi: 10.1093/llc/fqy074.