FNB-T5 Linguistically Informed Neural Translation of Code-Mixed Text

Surinder Pal Singh; Neeraj Mangla

doi:https://doi.org/10.14445/22315381/IJETT-V74I3P117

Research Article | Open Access | Download PDF

Volume 74 | Issue 3 | Year 2026 | Article Id. IJETT-V74I3P117 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I3P117

FNB-T5 Linguistically Informed Neural Translation of Code-Mixed Text

Surinder Pal Singh, Neeraj Mangla

Received	Revised	Accepted	Published
26 Nov 2025	26 Jan 2026	06 Feb 2026	28 Mar 2026

Citation :

Surinder Pal Singh, Neeraj Mangla, "FNB-T5 Linguistically Informed Neural Translation of Code-Mixed Text," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 3, pp. 228-247, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I3P117

Abstract

Code-mixed text and Hinglish in particular, pose considerable machine translation problems as it is an informal language, whose spelling is not universal, and often there is a language shift within a sentence. Conventional models of Neural Translation can be pretty ineffective in this field because they are linguistically grounded and have issues with ambiguous cases and scanty data. This paper is a proposal of a new hybrid translation system that combines the symbolic and neural translators in order to successfully translate code-mixed Hinglish texts to standard English. The offered system is built with the Finite-State Machine (FSM) of Structural Pattern Recognition, character n-gram similarity of spelling variation, phonetic alignment, and transformer-based Part-of-Speech (POS) tagging of syntactic interpretation. These characteristics are put together as enriched token prompts and sent to a fine-tuned Multilingual Pre-trained Text-to-Text Transformer (mT5) model to be translated. For a comprehensive assessment, the model was trained and evaluated on a carefully curated corpus containing 200,000 Hinglish sentences. Its performance was measured using standard metrics, including BLEU, TER, CHRF, and COMET. Study results indicate that the hybrid model is better than state-of-the-art neural, statistical, and rule-based models on all metrics: BLEU 39.2, Comet 0.74, ter 41.3, and CHRF 66.4. The model is robust in low-resource and noisy conditions, which are commonly affected by deep models. It is shown in this work how hybrid architectures can be helpful in multilingual and informal Natural Language Processing (NLP) systems, and provide a scalable solution to the increasing problem of code-mixed language translation.

Keywords

Linguistic Model, Low-Resource Machine Translation, Linguistic Feature Fusion, Neural–Symbolic Learning, Rule-Based and Neural Integration.

References

[1] Ahmad Fathan Hidayatullah et al., “A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development,” IEEE Access, vol. 10, pp. 122812-122831, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Vishal Gupta, Dilip Kumar Sharma, and Ashutosh Dixit, “Comparative Analysis of Term Extraction and Selection Techniques for Query Reformulation using PRF,” Emergent Converging Technologies and Biomedical Systems, pp. 515-526, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Ankur Mangla, Rakesh Kumar Bansal, and Savina Bansal, “Metric-based Techniques for Reducing Out-of-Vocabulary Rates in Romanised Text Processing,” Journal of Circuits, Systems and Computers, vol. 34, no. 15, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Alexander Barkalov et al., “Improving Characteristics of FSMs with Mixed Codes of Outputs,” IEEE Access, vol. 10, pp. 36152-36165, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Anne Perera, and Amitha Caldera, “Sentiment Analysis of Code-Mixed Text: A Comprehensive Review,” Journal of Universal Computer Science, vol. 30, no. 2, pp. 242-261, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Tusarkanta Dalai, Tapas Kumar Mishra, and Pankaj K. Sa, “Deep Learning-based POS Tagger and Chunker for Odia Language using Pre-Trained Transformers,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 2, pp. 1-23, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Shree Harsh Attri, T.V. Prasad, and Gavirneni RamaKrishna, “Translation of Code-Mixed Language to Monolingual Languages using Rule-Based Approach,” International Journal of Cloud Computing, vol. 10, no. 4, pp. 278-298, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Vivek Srivastava, and Mayank Singh, “PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation,” arXiv preprint, pp. 1-6, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Iqra Ameer et al., “Multi-Label Emotion Classification on Code-Mixed Text: Data and Methods,” IEEE Access, vol. 10, pp. 8779-8789, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Zheng-Xin Yong et al., “Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of Southeast Asian Languages,” arXiv preprint, pp. 1-21, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Bharathi Raja Chakravarthi et al., “Overview of the Track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text,” Proceedings of the 12^th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 21-24, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Bharathi Raja Chakravarthi et al., “A Sentiment Analysis Dataset for Code-Mixed Malayalam-English,” arXiv preprint, pp. 1-8, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Shree Harsh Attri, T.V. Prasad, and G. RamaKrishna, “HIPHET: A Hybrid Approach to Translate Code-Mixed Language to Pure Languages (Hindi and English),” Computer Science, vol. 21, no. 3, pp. 371-391, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Kogilavani Shanmugavadivel et al., “An Analysis of Machine Learning Models for Sentiment Analysis of Tamil Code-Mixed Data,” Computer Speech & Language, vol. 76, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Kogilavani Shanmugavadivel et al., “Deep Learning-based Sentiment Analysis and Offensive Language Identification on Multilingual Code-Mixed Data,” Scientific Reports, vol. 12, no. 1, pp. 1-12, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[16] O.E. Ojo et al., “Language Identification at the Word Level in Code-Mixed Texts using Character Sequence and Word Embedding,” Proceedings of the 19^th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, IIIT Delhi, New Delhi, India, pp. 1-6, 2022.
[Google Scholar] [Publisher Link]

[17] Ruba Priyadharshini et al., “Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding,” 2020 6^th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, pp. 68-72, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Adeep Hande, Ruba Priyadharshini, and Bharathi Raja Chakravarthi, “KanCMD: Kannada Code-Mixed Dataset for Sentiment Analysis and Offensive Language Detection,” Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media, Barcelona, Spain, pp. 54-63, 2020.
[Google Scholar] [Publisher Link]

[19] Vibhav Agarwal, Pooja Rao, and Dinesh Babu Jayagopi, “Hinglish to English Machine Translation using Multilingual Transformers,” Proceedings of the Student Research Workshop Associated with RANLP 2021, pp. 16-21, 2021.
[Google Scholar] [Publisher Link]

[20] Rrubaa Panchendrarajan, and Akrati Saxena, Deep Learning for Code-Mixed Text Mining in Social Media: A Brief Review, Deep Learning for Social Media Data Analytics, pp. 45-63, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[21] Ravindra Nayak, and Raviraj Joshi, “L3Cube-HingCorpus and HingBERT: A Code-Mixed Hindi-English Dataset and BERT Language Models,” arXiv preprint, pp. 1-11, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[22] Mohd Zeeshan Ansari et al., “Language identification of Hindi-English Tweets using Code-Mixed BERT,” 2021 IEEE 20^th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), Banff, AB, Canada, pp. 248-252, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[23] Saikat Roy, and Jatinderkumar R. Saini, “Sentiment Classification in Code-Mixed Indo-Aryan Languages: A Transformer-based Survey,” International Conference on Smart Systems: Innovations in Computing, pp. 391-406, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[24] Namrata Kumari, and Pardeep Singh, “Hindi Text Summarization using Sequence-to-Sequence Neural Network,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 10, pp. 1-18, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[25] Nalini S. Jagtap et al., “Improvised Neural Machine Translation Model for Hinglish to English,” International Conference on Data Science and Applications, pp. 237-247, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[26] Ankur Mangla, Rakesh Kumar Bansal, and Savina Bansal, “Language Identification and Normalization Techniques for Code-Mixed Text,” 2024 Sixth International Conference on Computational Intelligence and Communication Technologies (CCICT), Sonepat, India, pp. 435-441, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[27] Abhishek Chopra et al., “A Framework for Online Hate Speech Detection on Code-mixed Hindi-English Text and Hindi Text in Devanagari,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 5, pp. 1-21, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[28] Vishal Gupta, Ashutosh Dixit, and Shilpa Sethi, “Enhancing Temporal Information Retrieval through Contextual Query Reformulation,” 2023 International Conference on Advanced Computing & Communication Technologies (ICACCTech), Banur, India, pp. 765-770, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[29] Yinhan Liu et al., “Multilingual Denoising Pre-Training for Neural Machine Translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726-742, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[30] Yuqing Tang et al., “Multilingual Translation from Denoising Pre-Training,” Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3450-3466, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[31] Samta Kamboj, Sunil Kumar Sahu, and Neha Sengupta, “DENTRA: Denoising and Translation Pre-Training for Multilingual Machine Translation,” Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, pp. 1057-1067, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[32] S. Nagesh Bhattu et al., “Improving Code-Mixed POS Tagging using Code-Mixed Embeddings,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 19, no. 4, pp. 1-31, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[33] Shayaan Absar, “Fine-Tuning Cross-Lingual LLMs for POS Tagging in Code-Switched Contexts,” Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025), Tallinn, Estonia, pp. 7-12, 2025.
[Google Scholar] [Publisher Link]

[34] Rob van der Goot, and Özlem Çetinoğlu, “Lexical Normalization for Code-Switched Data and its Effect on POS-Tagging,” arXiv preprint, pp. 1-14, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[35] Vidya Saikrishna, D.L. Dowe, and Sid Ray, MML Learning and Inference of Hierarchical Probabilistic Finite State Machines, 1^st ed., Applied Data Analytics - Principles and Applications, pp. 291-325, 2022.
[Google Scholar] [Publisher Link]

[36] Anej Svete, and Ryan Cotterell, “Recurrent Neural Language Models as Probabilistic Finite-State Automata,” arXiv preprint, pp. 1-18, 2023.

[CrossRef] [Google Scholar] [Publisher Link]

[37] Alberto Postiglione, “Finite State Automata on Multi-Word Units for Efficient Text-Mining,” Mathematics, vol. 12, no. 4, pp. 1-20, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[38] Ganesh Jawahar et al., “Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing,” arXiv preprint, pp. 1-11, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[39] Linting Xue et al., “mt5: A Massively Multilingual Pre-Trained Text-to-Text Transformer,” arXiv preprint, pp. 1-17, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[40] Mikhail Krasitskii et al., “Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-based Solutions,” arXiv preprint, pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[41] El Moatez Billah Nagoudi, AbdelRahim Elmadany, and Muhammad Abdul-Mageed, “Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation,” arXiv preprint, pp. 1-9, 2021.
[CrossRef] [Publisher Link]

[42] Shikha Mundra, and Namita Mittal, “CMHE-AN: Code Mixed Hybrid Embedding based Attention Network for Aggression Identification in Hindi English Code-Mixed Text,” Multimedia Tools and Applications, vol. 82, no. 8, pp. 11337-11364, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[43] K. Sudharsan et al., “A Deep Hybrid Learning Model for Classification of Code-Mixed Text,” 2022 International Conference on Futuristic Technologies (INCOFT), Belgaum, India, pp. 1-6, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[44] Ayan Sengupta et al., “HIT: A Hierarchically Fused Deep Attention Network for Robust Code-Mixed Language Representation,” arXiv preprint, pp. 1-15, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[45] Kian Long Tan et al., “RoBERTa-LSTM: A Hybrid Model for Sentiment Analysis with Transformer and Recurrent Neural Network,” IEEE Access, vol. 10, pp. 21517-21525, 2022.
[CrossRef] [Google Scholar] [Publisher Link]