An Optimized Hybrid Deep Learning Model for Text-to-Speech

An Optimized Hybrid Deep Learning Model for Text-to-Speech

	© 2025 by IJETT Journal
	Volume-73 Issue-4
	Year of Publication : 2025
	Author : Hani Q.R. Al-Zoubi
	DOI : 10.14445/22315381/IJETT-V73I4P130

How to Cite?
Hani Q.R. Al-Zoubi, "An Optimized Hybrid Deep Learning Model for Text-to-Speech," International Journal of Engineering Trends and Technology, vol. 73, no. 4, pp.376-385, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I4P130

Abstract
This work presents an advanced hybrid deep learning model optimized to obtain a superior Text-to-Speech (TTS) conversion. The model employs Convolutional Neural Networks (CNNs) to extract features from the text effectively. Recurrent neural networks, also known as RNNs, are used to identify sequential linkages and to enhance context awareness. The developed hybrid design aims to improve both the quality of synthesis and computational performance. In this regard, the optimization enables the adjustment of the parameters and training of the dataset refill, elucidating a potential and consistent performance across linguistic circumstances. The suggested model employs transfer learning methods that take advantage of pre-trained embedding to accelerate the convergence process. This research delves into the influence of different hyper-parameter configurations on the model's efficiency, offering valuable insights into key factors that impact the optimisation process. Via a specific evaluation of benchmark datasets, the obtained results demonstrate that the present model has higher simplicity, proficiency, and average TTS quality if compared to other conventional techniques. Thus, it can be concluded that the developed hybrid model can demonstrate exceptional performance in real-time text-to-speech (TTS) applications, meaningfully aiding the development of artificial intelligence-driven voice synthesis.

Keywords
Text-to-Speech (TTS), Deep learning hybrid model, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transfer learning, Hyperparameter tuning, Real-time systems, Artificial intelligence.

References
[1] Yuxuan Wang et al., “Tacotron: Towards End-to-End Speech Synthesis,” arXiv Preprint, pp. 1-10, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Aaron van den Oord et al., “Wavenet: A Generative Model for Raw Audio,” arXiv Preprint, pp. 1-15, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Jonathan Shen et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, pp. 4779-4783, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Surabhi Sudhan, Parvathy P. Nair, and Mg. Thushara, “Text-to-Speech and Speech-to-Text Models: A Systematic Examination of Diverse Approaches,” IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, pp. 1-8, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Yi Ren et al., “Fastspeech: Fast, Robust and Controllable Text to Speech,” Advances in Neural Information Processing Systems: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, vol. 32, pp. 1-10, 2019.
[Google Scholar] [Publisher Link]
[6] Yibin Zheng et al., “Improving End-to-End Speech Synthesis with Local Recurrent Neural Network Enhanced Transformer,” IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6734-6738, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Liisa Rätsep, Rasmus Lellep, and Mark Fishel, “Estonian Text-to-Speech Synthesis with Non-autoregressive Transformers,” Baltic Journal of Modern Computing, vol. 10, no. 3, pp. 447-456, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Francesco Tombini, “A Dynamic Deep Learning Approach for Intonation Modeling,” Master’s Thesis, Saarland University, pp. 1-114, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu, “Pixel Recurrent Neural Networks,” Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, vol. 48, pp. 1747-1756, 2016.
[Google Scholar] [Publisher Link]
[10] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” arXiv Preprint, pp. 1-11, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Mohammed Salah Al-Radhi, Tamás Gábor Csapó, and Géza Németh, “Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder,” International Conference on Speech and Computer, Hatfield, United Kingdom, pp. 282-291, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Jan K. Chorowski et al., “Attention-based Models for Speech Recognition,” Advances in Neural Information Processing Systems, vol. 28, pp. 1-9, 2015.
[Google Scholar] [Publisher Link]
[13] Soroush Mehri et al., “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model,” arXiv Preprint, pp. 1-11, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Rachel Manzelli et al., “An End to End Model for Automatic Music Generation: Combining Deep Raw and Symbolic Audio Networks,” Proceedings of the Musical Metacreation Workshop at 9th International Conference on Computational Creativity, Salamanca, Spain, pp. 1-6, 2018.
[Google Scholar] [Publisher Link]
[15] Titouan Parcollet et al., “Bidirectional Quaternion Long Short-Term Memory Recurrent Neural Networks for Speech Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, pp. 8519-8523, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Sercan Ö. Arık et al., “Deep Voice: Real-time Neural Text-to-Speech,” Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, vol. 70, pp. 195-204, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

IJBTT

An Optimized Hybrid Deep Learning Model for Text-to-Speech