ViT-NARCap: Image Captioning with Vision Transformer Context-Aware Nucleus Sampling and Roberta Re-Ranker

Bhargavi Polepalli; Praveen Kumar Sekharamantry; Konda Srinivasa Rao

doi:https://doi.org/10.14445/22315381/IJETT-V74I3P112

Research Article | Open Access | Download PDF

Volume 74 | Issue 3 | Year 2026 | Article Id. IJETT-V74I3P112 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I3P112

ViT-NARCap: Image Captioning with Vision Transformer Context-Aware Nucleus Sampling and Roberta Re-Ranker

Bhargavi Polepalli, Praveen Kumar Sekharamantry, Konda Srinivasa Rao

Received	Revised	Accepted	Published
01 Jul 2025	13 Oct 2025	17 Nov 2025	28 Mar 2026

Citation :

Bhargavi Polepalli, Praveen Kumar Sekharamantry, Konda Srinivasa Rao, "ViT-NARCap: Image Captioning with Vision Transformer Context-Aware Nucleus Sampling and Roberta Re-Ranker," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 3, pp. 153-168, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I3P112

Abstract

Image captioning is a semantically correct and linguistically competent attempt to produce textual captioning based on visual art, but it is difficult as it is restricted by contextual knowledge and language variety. The traditional encoder-decoder systems that are usually founded on CNN encoders and RNN decoders have problems with long-range dependency estimation, exposure bias, and repetitive captioning. In response to these shortcomings, this paper suggests a superior image captioning network that combines a Vision Transformer encoder and a normalized auto-regressive fine-tuned Transformer decoder. The Vision Transformer is one of the better techniques for capturing hierarchical visual representations and global spatial relationships. In contrast, the transformer-based decoder holds together coherent and context-aware sentence generation. To increase further the diversity and fluency of captions, the idea of nucleus sampling is used in the process of decoding, and a reranking mechanism based on the use of RoBERTa is presented to fine-tune the selected captions in accordance with semantic relevance. The results of the experimental Analysis of the benchmark datasets indicate that the offered approach tends to be superior to the current ones in standard measures, such as BLEU, CIDEr, ROUGE-L, and METEOR, which proves its efficiency and strength.

Keywords

Recurrent Neural Network (RNN), Convolutional Neural Networks (CNN), Image Captioning, Nucleus Sampling, and Vision Transformer.

References

[1]L. Ashok Kumar, and D. Karthika Renuka, Deep Learning Approach for Natural Language Processing, Speech, and Computer Vision, Techniques and Use Cases, 1^st ed., CRC Press, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[2] V. Ajantha Devi, and Mohd Naved, Dive in Deep Learning: Computer Vision, Natural Language Processing, and Signal Processing, Machine Learning in Signal Processing, 1^st ed., Chapman and Hall/CRC, 2021.
[Google Scholar] [Publisher Link]

[3] Mohammad Mustafa Taye, “Understanding of Machine Learning with Deep Learning: Architectures, Workflow, Applications and Future Directions,” Computers, vol. 12, no. 5, pp. 1-26, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Shervin Minaee et al., “Image Segmentation using Deep Learning: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3523-3542, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Reza Azad et al., “Medical Image Segmentation Review: The Success of U-Net,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10076-10095, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Yun Yang et al., “Two-stage Selective Ensemble of CNN via Deep Tree Training for Medical Image Classification,” IEEE Transactions on Cybernetics, vol. 52, no. 9, pp. 9194-9207, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Ahmed A. Elngar et al., “Image Classification based on CNN: A Survey,” Journal of Cybersecurity and Information Management, vol. 6, no. 1, pp. 18-50, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Akhilesh Kumar Sharma et al., “An Efficient Approach of Product Recommendation System using NLP Technique,” Materials Today: Proceedings, vol. 80, pp. 3730-3743, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Suraj Bodapati et al., Comparison and Analysis of RNN-LSTMs and CNNs for Social Reviews Classification, Advances in Applications of Data-Driven Computing, pp. 49-59, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Shashank Mohan Jain, Introduction to Transformers for NLP, With the Hugging Face Library and Models to Solve Problems, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Chia Xin Liang et al., “A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks,” arXiv preprint, pp. 1-115, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Taraneh Ghandi, Hamidreza Pourreza, and Hamidreza Mahyar, “Deep Learning Approaches on Image Captioning: A Review,” ACM Computing Surveys, vol. 56, no. 3, pp. 1-39, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Ramesh Sneka Nandhini, and Ramanathan Lakshmanan, “QCNN_BaOpt: Multi-Dimensional Data-based Traffic-Volume Prediction in Cyber-Physical Systems,” Sensors, vol. 23, no. 3, pp. 1-16, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Bowen Xin et al., “A Comprehensive Survey on Deep-Learning-based Visual Captioning,” Multimedia Systems, vol. 29, no. 6, pp. 3781-3804, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Qazi Anwar, and Ch.V.S. Satyamurty, “An Analysis on Recent Approaches for Image Captioning,” CVR Journal of Science and Technology, vol. 26, no. 1, pp. 87-92, 2024.
[Google Scholar] [Publisher Link]

[16] Jafar A. Alzubi et al., “Deep Image Captioning using An Ensemble of CNN and LSTM based Deep Neural Networks,” Journal of Intelligent & Fuzzy Systems, vol. 40, no. 4, pp. 5761-5769, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Samar Elbedwehy et al., “Efficient Image Captioning based on Vision Transformer Models,” Computers, Materials & Continua, vol. 73, no. 1, pp. 1483-1500, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Akash Verma et al., “Automatic Image Caption Generation using Deep Learning,” Multimedia Tools and Applications, vol. 83, no. 2, pp. 5309-5325, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[19] Ahatesham Bhuiyan et al., “Enhancing Image Caption Generation through Context-Aware Attention Mechanism,” Heliyon, vol. 10, no. 17, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Santosh Kumar Mishra et al., “Efficient Channel Attention based Encoder-Decoder Approach for Image Captioning in Hindi,” Transactions on Asian and Low-Resource Language Information Processing, vol. 21, no. 3, pp. 1-17, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[21] Shourya Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, pp. 1-23, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[22] Huimin Lu et al., “Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTM,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 1S, pp. 1-18, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[23] Flickr 8k Dataset, Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/adityajn105/flickr8k

[24] Hsankesara, Flickr Image Dataset, Kaggle, 2018. [Online]. Available: https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset

[25] Andrej Karpathy, and Li Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, 2015.
[Google Scholar] [Publisher Link]

[26] Teng Jiang, Zehan Zhang, and Yupu Yang, “Modeling Coverage with Semantic Embedding for Image Caption Generation,” The Visual Computer, vol. 35, no. 11, pp. 1655-1665, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[27] Amish Patel, and Aravind Varier, “Hyperparameter Analysis for Image Captioning,” arXiv preprint, pp. 1-10, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[28] Harshitha Katpally, and Ajay Bansal, “Ensemble Learning on Deep Neural Networks for Image Caption Generation,” 2020 IEEE 14^thInternational Conference on Semantic Computing (ICSC), San Diego, CA, USA, pp. 61-68, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[29] J. Bineeshia, “Image Caption Generation using CNN-LSTM based Approach,” ICCAP 2021: Proceedings of the First International Conference on Combinatorial and Optimization, Chennai, India, 2021.
[Google Scholar]

[30] Zagon Bussabong et al., “Enhancing Image Caption Performance with Improved Visual Attention Mechanism,” ICIC Express Letters

Part B: Applications, vol. 16, no. 1, pp. 73-82, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[31] Yiwei Ma et al., “Towards Local Visual Modeling for Image Captioning,” Pattern Recognition, vol. 138, pp. 1-32, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[32] Israa Al Badarneh, Bassam H. Hammo, and Omar Al-Kadi, “An Ensemble Model with Attention-based Mechanism for Image Captioning,” Computers and Electrical Engineering, vol. 123, pp. 1-35, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[33] Quanzeng You et al., “Image Captioning with Semantic Attention,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 4651-4659, 2016.
[CrossRef] [Google Scholar] [Publisher Link]

[34] Kun Fu et al., “Aligning Where to See and What to Tell: Image Captioning with Region-based Attention and Scene-Specific Contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321-2334, 2016.
[CrossRef] [Google Scholar] [Publisher Link]

[35] Jiasen Lu et al., “Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 3242-3250, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[36] Chen He, and Haifeng Hu, “Image Captioning with Text-based Visual Attention,” Neural Processing Letters, vol. 49, no. 1, pp. 177-185, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[37] Tiago do Carmo Nogueira et al., “Reference-based Model using Multimodal Gated Recurrent Units for Image Captioning,” Multimedia Tools and Applications, vol. 79, no. 41-42, pp. 30615-30635, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[38] Marimuthu Kalimuthu et al., “Fusion Models for Improved Image Captioning,” Pattern Recognition, ICPR International Workshops and Challenges, pp. 381-395, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[39] Amr Abdussalam et al., “Numcap: A Number-Controlled Multi-Caption Image Captioning Network,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 4, pp. 1-24, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[40] Jianlin Su et al., “Roformer: Enhanced Transformer with Rotary Position Embedding,” Neurocomputing, vol. 568, pp. 1-14, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[41] Shaowen Wang, Linxi Yu, and Jian Li, “LoRA-GA: Low-Rank Adaptation with Gradient Approximation,” arXiv preprint, pp. 1-19, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[42] Yue Yang et al., “Remote Sensing Image Change Captioning using Multi-Attentive Network with Diffusion Model,” Remote Sensing, vol. 16, no. 21, pp. 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[43] Yongshuo Zhu et al., “Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-16, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[44] Shuai Bai et al., “Qwen2.5-vl Technical Report,” arXiv preprint, pp. 1-23, 2025.
[CrossRef] [Publisher Link]

[45] Yuduo Wang, Weikang Yu, and Pedram Ghamisi, “Change Captioning in Remote Sensing: Evolution to SAT-Cap-A Single-Stage Transformer Approach,” arXiv preprint, 1-18, 2025.
[CrossRef] [Google Scholar] [Publisher Link]