Research Article | Open Access | Download PDF
Volume 74 | Issue 3 | Year 2026 | Article Id. IJETT-V74I3P117 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I3P117FNB-T5 Linguistically Informed Neural Translation of Code-Mixed Text
Surinder Pal Singh, Neeraj Mangla
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 26 Nov 2025 | 26 Jan 2026 | 06 Feb 2026 | 28 Mar 2026 |
Citation :
Surinder Pal Singh, Neeraj Mangla, "FNB-T5 Linguistically Informed Neural Translation of Code-Mixed Text," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 3, pp. 228-247, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I3P117
Abstract
Code-mixed text and Hinglish in particular, pose considerable machine translation problems as it is an informal language, whose spelling is not universal, and often there is a language shift within a sentence. Conventional models of Neural Translation can be pretty ineffective in this field because they are linguistically grounded and have issues with ambiguous cases and scanty data. This paper is a proposal of a new hybrid translation system that combines the symbolic and neural translators in order to successfully translate code-mixed Hinglish texts to standard English. The offered system is built with the Finite-State Machine (FSM) of Structural Pattern Recognition, character n-gram similarity of spelling variation, phonetic alignment, and transformer-based Part-of-Speech (POS) tagging of syntactic interpretation. These characteristics are put together as enriched token prompts and sent to a fine-tuned Multilingual Pre-trained Text-to-Text Transformer (mT5) model to be translated. For a comprehensive assessment, the model was trained and evaluated on a carefully curated corpus containing 200,000 Hinglish sentences. Its performance was measured using standard metrics, including BLEU, TER, CHRF, and COMET. Study results indicate that the hybrid model is better than state-of-the-art neural, statistical, and rule-based models on all metrics: BLEU 39.2, Comet 0.74, ter 41.3, and CHRF 66.4. The model is robust in low-resource and noisy conditions, which are commonly affected by deep models. It is shown in this work how hybrid architectures can be helpful in multilingual and informal Natural Language Processing (NLP) systems, and provide a scalable solution to the increasing problem of code-mixed language translation.
Keywords
Linguistic Model, Low-Resource Machine Translation, Linguistic Feature Fusion, Neural–Symbolic Learning, Rule-Based and Neural Integration.
References
[1] Ahmad Fathan Hidayatullah et al., “A
Systematic Review on Language Identification of Code-Mixed Text: Techniques,
Data Availability, Challenges, and Framework Development,” IEEE Access, vol. 10, pp. 122812-122831, 2022.
[CrossRef] [Google Scholar] [Publisher
Link]
[2] Vishal Gupta, Dilip Kumar Sharma, and
Ashutosh Dixit, “Comparative Analysis of Term Extraction and Selection
Techniques for Query Reformulation using PRF,” Emergent Converging Technologies and Biomedical Systems, pp.
515-526, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Ankur Mangla, Rakesh Kumar Bansal, and
Savina Bansal, “Metric-based Techniques for Reducing Out-of-Vocabulary Rates in
Romanised Text Processing,” Journal of Circuits,
Systems and Computers, vol. 34, no. 15, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Alexander Barkalov et al., “Improving
Characteristics of FSMs with Mixed Codes of Outputs,” IEEE Access, vol. 10, pp. 36152-36165, 2022.
[CrossRef] [Google Scholar] [Publisher
Link]
[5] Anne Perera, and Amitha Caldera, “Sentiment Analysis of
Code-Mixed Text: A Comprehensive Review,” Journal
of Universal Computer Science, vol. 30, no. 2, pp. 242-261, 2024.
[CrossRef] [Google Scholar] [Publisher
Link]
[6] Tusarkanta Dalai, Tapas Kumar Mishra, and Pankaj K. Sa, “Deep
Learning-based POS Tagger and Chunker for Odia Language using Pre-Trained
Transformers,” ACM Transactions on Asian
and Low-Resource Language Information Processing, vol. 23, no. 2, pp. 1-23,
2024.
[CrossRef] [Google Scholar] [Publisher
Link]
[7] Shree Harsh Attri, T.V. Prasad, and Gavirneni RamaKrishna,
“Translation of Code-Mixed Language to Monolingual Languages using Rule-Based
Approach,” International Journal of Cloud
Computing, vol. 10, no. 4, pp. 278-298, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Vivek Srivastava, and Mayank Singh, “PHINC: A Parallel Hinglish
Social Media Code-Mixed Corpus for Machine Translation,” arXiv preprint, pp. 1-6, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[9] Iqra Ameer et al., “Multi-Label Emotion Classification on
Code-Mixed Text: Data and Methods,” IEEE Access,
vol. 10, pp. 8779-8789, 2022.
[CrossRef] [Google Scholar] [Publisher
Link]
[10] Zheng-Xin Yong et al.,
“Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The
Case of Southeast Asian Languages,” arXiv
preprint, pp. 1-21, 2023.
[CrossRef] [Google Scholar] [Publisher
Link]
[11] Bharathi Raja Chakravarthi
et al., “Overview of the Track on Sentiment Analysis for Dravidian Languages in
Code-Mixed Text,” Proceedings of the 12th
Annual Meeting of the Forum for Information Retrieval Evaluation, pp.
21-24, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[12] Bharathi Raja Chakravarthi
et al., “A Sentiment Analysis Dataset for Code-Mixed Malayalam-English,” arXiv preprint, pp. 1-8, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[13] Shree Harsh Attri, T.V.
Prasad, and G. RamaKrishna, “HIPHET: A Hybrid Approach to Translate Code-Mixed
Language to Pure Languages (Hindi and English),” Computer Science, vol. 21, no. 3, pp. 371-391, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Kogilavani Shanmugavadivel
et al., “An Analysis of Machine Learning Models for Sentiment Analysis of Tamil
Code-Mixed Data,” Computer Speech &
Language, vol. 76, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Kogilavani Shanmugavadivel
et al., “Deep Learning-based Sentiment Analysis and Offensive Language
Identification on Multilingual Code-Mixed Data,” Scientific Reports, vol. 12, no. 1, pp. 1-12, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[16] O.E. Ojo et al., “Language
Identification at the Word Level in Code-Mixed Texts using Character Sequence
and Word Embedding,” Proceedings of the
19th International Conference on Natural Language Processing (ICON):
Shared Task on Word Level Language Identification in Code-mixed Kannada-English
Texts, IIIT Delhi, New Delhi, India, pp. 1-6, 2022.
[Google Scholar] [Publisher
Link]
[17] Ruba Priyadharshini et al.,
“Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding,” 2020 6th International Conference
on Advanced Computing and Communication Systems (ICACCS), Coimbatore,
India, pp. 68-72, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[18] Adeep Hande, Ruba
Priyadharshini, and Bharathi Raja Chakravarthi, “KanCMD: Kannada Code-Mixed
Dataset for Sentiment Analysis and Offensive Language Detection,” Proceedings of the Third Workshop on
Computational Modeling of People's Opinions, Personality, and Emotion's in
Social Media, Barcelona, Spain, pp. 54-63, 2020.
[Google Scholar] [Publisher
Link]
[19] Vibhav Agarwal, Pooja Rao,
and Dinesh Babu Jayagopi, “Hinglish to English Machine Translation using
Multilingual Transformers,” Proceedings
of the Student Research Workshop Associated with RANLP 2021, pp. 16-21,
2021.
[Google Scholar] [Publisher
Link]
[20] Rrubaa Panchendrarajan, and
Akrati Saxena, Deep Learning for
Code-Mixed Text Mining in Social Media: A Brief Review, Deep Learning for
Social Media Data Analytics, pp. 45-63, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Ravindra Nayak, and Raviraj
Joshi, “L3Cube-HingCorpus and HingBERT: A Code-Mixed Hindi-English Dataset and
BERT Language Models,” arXiv preprint,
pp. 1-11, 2022.
[CrossRef] [Google Scholar] [Publisher
Link]
[22] Mohd Zeeshan Ansari et al.,
“Language identification of Hindi-English Tweets using Code-Mixed BERT,” 2021 IEEE 20th International
Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC),
Banff, AB, Canada, pp. 248-252, 2021.
[CrossRef] [Google Scholar] [Publisher
Link]
[23] Saikat Roy, and
Jatinderkumar R. Saini, “Sentiment Classification in Code-Mixed Indo-Aryan
Languages: A Transformer-based Survey,” International
Conference on Smart Systems: Innovations in Computing, pp. 391-406, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Namrata Kumari, and Pardeep
Singh, “Hindi Text Summarization using Sequence-to-Sequence Neural Network,” ACM Transactions on Asian and Low-Resource
Language Information Processing, vol. 22, no. 10, pp. 1-18, 2023.
[CrossRef] [Google Scholar] [Publisher
Link]
[25] Nalini S. Jagtap et al.,
“Improvised Neural Machine Translation Model for Hinglish to English,” International Conference on Data Science and
Applications, pp. 237-247, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Ankur Mangla, Rakesh Kumar
Bansal, and Savina Bansal, “Language Identification and Normalization
Techniques for Code-Mixed Text,” 2024
Sixth International Conference on Computational Intelligence and Communication
Technologies (CCICT), Sonepat, India, pp. 435-441, 2024.
[CrossRef] [Google Scholar] [Publisher
Link]
[27] Abhishek Chopra et al., “A
Framework for Online Hate Speech Detection on Code-mixed Hindi-English Text and
Hindi Text in Devanagari,” ACM
Transactions on Asian and Low-Resource Language Information Processing,
vol. 22, no. 5, pp. 1-21, 2023.
[CrossRef] [Google Scholar] [Publisher
Link]
[28] Vishal Gupta, Ashutosh
Dixit, and Shilpa Sethi, “Enhancing Temporal Information Retrieval through
Contextual Query Reformulation,” 2023
International Conference on Advanced Computing & Communication Technologies
(ICACCTech), Banur, India, pp. 765-770, 2023.
[CrossRef] [Google Scholar] [Publisher
Link]
[29] Yinhan Liu et al.,
“Multilingual Denoising Pre-Training for Neural Machine Translation,” Transactions of the Association for
Computational Linguistics, vol. 8, pp. 726-742, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Yuqing Tang et al.,
“Multilingual Translation from Denoising Pre-Training,” Findings of the Association for Computational Linguistics: ACL-IJCNLP
2021, pp. 3450-3466, 2021.
[CrossRef] [Google Scholar] [Publisher
Link]
[31] Samta Kamboj, Sunil Kumar
Sahu, and Neha Sengupta, “DENTRA: Denoising and Translation Pre-Training for
Multilingual Machine Translation,” Proceedings
of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United
Arab Emirates, pp. 1057-1067, 2022.
[CrossRef] [Google Scholar] [Publisher
Link]
[32] S. Nagesh Bhattu et al.,
“Improving Code-Mixed POS Tagging using Code-Mixed Embeddings,” ACM Transactions on Asian and Low-Resource
Language Information Processing (TALLIP), vol. 19, no. 4, pp. 1-31, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[33] Shayaan Absar, “Fine-Tuning
Cross-Lingual LLMs for POS Tagging in Code-Switched Contexts,” Proceedings of the Third Workshop on
Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025),
Tallinn, Estonia, pp. 7-12, 2025.
[Google Scholar] [Publisher
Link]
[34] Rob van der Goot, and Özlem
Çetinoğlu, “Lexical Normalization for Code-Switched Data and its Effect on
POS-Tagging,” arXiv preprint, pp.
1-14, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[35] Vidya Saikrishna, D.L.
Dowe, and Sid Ray, MML Learning and
Inference of Hierarchical Probabilistic Finite State Machines, 1st
ed., Applied Data Analytics - Principles and Applications, pp. 291-325, 2022.
[Google Scholar] [Publisher Link]
[36] Anej Svete, and Ryan Cotterell, “Recurrent Neural Language Models as Probabilistic Finite-State Automata,” arXiv preprint, pp. 1-18, 2023.
[CrossRef] [Google Scholar] [Publisher
Link]
[37] Alberto Postiglione,
“Finite State Automata on Multi-Word Units for Efficient Text-Mining,” Mathematics, vol. 12, no. 4, pp. 1-20,
2024.
[CrossRef] [Google Scholar] [Publisher
Link]
[38] Ganesh Jawahar et al.,
“Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing,” arXiv
preprint, pp. 1-11, 2021.
[CrossRef] [Google Scholar] [Publisher
Link]
[39] Linting Xue et al., “mt5: A
Massively Multilingual Pre-Trained Text-to-Text Transformer,” arXiv preprint, pp. 1-17, 2020.
[CrossRef] [Google Scholar] [Publisher
Link]
[40] Mikhail Krasitskii et al.,
“Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and
Transformer-based Solutions,” arXiv preprint,
pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher
Link]
[41] El Moatez Billah Nagoudi,
AbdelRahim Elmadany, and Muhammad Abdul-Mageed, “Investigating Code-Mixed
Modern Standard Arabic-Egyptian to English Machine Translation,” arXiv preprint, pp. 1-9, 2021.
[CrossRef] [Publisher
Link]
[42] Shikha Mundra, and Namita
Mittal, “CMHE-AN: Code Mixed Hybrid Embedding based Attention Network for
Aggression Identification in Hindi English Code-Mixed Text,” Multimedia Tools and Applications, vol.
82, no. 8, pp. 11337-11364, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[43] K. Sudharsan et al., “A
Deep Hybrid Learning Model for Classification of Code-Mixed Text,” 2022 International Conference on Futuristic
Technologies (INCOFT), Belgaum, India, pp. 1-6, 2022.
[CrossRef] [Google Scholar] [Publisher
Link]
[44] Ayan Sengupta et al., “HIT:
A Hierarchically Fused Deep Attention Network for Robust Code-Mixed Language
Representation,” arXiv preprint, pp.
1-15, 2021.
[CrossRef] [Google Scholar] [Publisher
Link]
[45] Kian Long Tan et al.,
“RoBERTa-LSTM: A Hybrid Model for Sentiment Analysis with Transformer and
Recurrent Neural Network,” IEEE Access,
vol. 10, pp. 21517-21525, 2022.
[CrossRef] [Google Scholar] [Publisher
Link]