Research Article | Open Access | Download PDF
Volume 74 | Issue 3 | Year 2026 | Article Id. IJETT-V74I3P112 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I3P112ViT-NARCap: Image Captioning with Vision Transformer Context-Aware Nucleus Sampling and Roberta Re-Ranker
Bhargavi Polepalli, Praveen Kumar Sekharamantry, Konda Srinivasa Rao
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 01 Jul 2025 | 13 Oct 2025 | 17 Nov 2025 | 28 Mar 2026 |
Citation :
Bhargavi Polepalli, Praveen Kumar Sekharamantry, Konda Srinivasa Rao, "ViT-NARCap: Image Captioning with Vision Transformer Context-Aware Nucleus Sampling and Roberta Re-Ranker," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 3, pp. 153-168, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I3P112
Abstract
Image captioning is a semantically correct and linguistically competent attempt to produce textual captioning based on visual art, but it is difficult as it is restricted by contextual knowledge and language variety. The traditional encoder-decoder systems that are usually founded on CNN encoders and RNN decoders have problems with long-range dependency estimation, exposure bias, and repetitive captioning. In response to these shortcomings, this paper suggests a superior image captioning network that combines a Vision Transformer encoder and a normalized auto-regressive fine-tuned Transformer decoder. The Vision Transformer is one of the better techniques for capturing hierarchical visual representations and global spatial relationships. In contrast, the transformer-based decoder holds together coherent and context-aware sentence generation. To increase further the diversity and fluency of captions, the idea of nucleus sampling is used in the process of decoding, and a reranking mechanism based on the use of RoBERTa is presented to fine-tune the selected captions in accordance with semantic relevance. The results of the experimental Analysis of the benchmark datasets indicate that the offered approach tends to be superior to the current ones in standard measures, such as BLEU, CIDEr, ROUGE-L, and METEOR, which proves its efficiency and strength.
Keywords
Recurrent Neural Network (RNN), Convolutional Neural Networks (CNN), Image Captioning, Nucleus Sampling, and Vision Transformer.
References
[1]L. Ashok Kumar, and D. Karthika Renuka, Deep Learning Approach for Natural Language Processing,
Speech, and Computer Vision, Techniques
and Use Cases, 1st ed., CRC Press, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[2] V. Ajantha Devi, and Mohd Naved, Dive in Deep Learning: Computer Vision,
Natural Language Processing, and Signal Processing, Machine Learning in Signal Processing, 1st ed., Chapman
and Hall/CRC, 2021.
[Google
Scholar] [Publisher
Link]
[3] Mohammad Mustafa Taye,
“Understanding of Machine Learning with Deep Learning: Architectures, Workflow,
Applications and Future Directions,” Computers,
vol. 12, no. 5, pp. 1-26, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[4] Shervin Minaee et al., “Image Segmentation using
Deep Learning: A Survey,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7,
pp. 3523-3542, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[5] Reza Azad et al.,
“Medical Image Segmentation Review: The Success of U-Net,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 46, no. 12, pp. 10076-10095, 2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[6] Yun Yang et al., “Two-stage Selective Ensemble of
CNN via Deep Tree Training for Medical Image Classification,” IEEE Transactions on Cybernetics, vol.
52, no. 9, pp. 9194-9207, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[7] Ahmed A. Elngar et al., “Image Classification based
on CNN: A Survey,” Journal of
Cybersecurity and Information Management, vol. 6, no. 1, pp. 18-50, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[8] Akhilesh Kumar Sharma et al., “An Efficient
Approach of Product Recommendation System using NLP Technique,” Materials Today: Proceedings, vol. 80,
pp. 3730-3743, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[9] Suraj Bodapati et al., Comparison and Analysis of RNN-LSTMs and
CNNs for Social Reviews Classification, Advances in Applications of
Data-Driven Computing, pp. 49-59, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[10] Shashank Mohan Jain, Introduction to Transformers for NLP,
With the Hugging Face Library and Models to Solve Problems, 2022.
[CrossRef] [Google
Scholar] [Publisher
Link]
[11] Chia Xin Liang et al.,
“A Comprehensive Survey and Guide to Multimodal Large Language Models in
Vision-Language Tasks,” arXiv preprint,
pp. 1-115, 2025.
[CrossRef] [Google
Scholar] [Publisher
Link]
[12] Taraneh Ghandi,
Hamidreza Pourreza, and Hamidreza Mahyar, “Deep Learning Approaches on Image
Captioning: A Review,” ACM Computing
Surveys, vol. 56, no. 3, pp. 1-39, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[13] Ramesh Sneka Nandhini,
and Ramanathan Lakshmanan, “QCNN_BaOpt: Multi-Dimensional Data-based
Traffic-Volume Prediction in Cyber-Physical Systems,” Sensors, vol. 23, no. 3, pp. 1-16, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[14] Bowen Xin et al., “A Comprehensive Survey on Deep-Learning-based
Visual Captioning,” Multimedia Systems,
vol. 29, no. 6, pp. 3781-3804, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[15] Qazi Anwar, and Ch.V.S. Satyamurty, “An Analysis on
Recent Approaches for Image Captioning,” CVR
Journal of Science and Technology, vol. 26, no. 1, pp. 87-92, 2024.
[Google
Scholar] [Publisher
Link]
[16] Jafar A. Alzubi et al.,
“Deep Image Captioning using An Ensemble of CNN and LSTM based Deep Neural
Networks,” Journal of Intelligent &
Fuzzy Systems, vol. 40, no. 4, pp. 5761-5769, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[17] Samar Elbedwehy et al., “Efficient Image Captioning based on Vision Transformer
Models,” Computers, Materials &
Continua, vol. 73, no. 1, pp. 1483-1500, 2022.
[CrossRef] [Google
Scholar] [Publisher
Link]
[18] Akash Verma et al.,
“Automatic Image Caption Generation using Deep Learning,” Multimedia Tools and Applications, vol. 83, no. 2, pp. 5309-5325,
2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[19] Ahatesham Bhuiyan et al., “Enhancing Image Caption Generation through Context-Aware
Attention Mechanism,” Heliyon, vol.
10, no. 17, pp. 1-17, 2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[20] Santosh Kumar Mishra et
al., “Efficient Channel Attention based Encoder-Decoder Approach for Image
Captioning in Hindi,” Transactions on
Asian and Low-Resource Language Information Processing, vol. 21, no. 3, pp.
1-17, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[21] Shourya Tyagi et al., “Novel
Advance Image Caption Generation Utilizing Vision Transformer and Generative
Adversarial Networks,” Computers, vol. 13, no. 12, pp. 1-23, 2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[22] Huimin Lu et al.,
“Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTM,” ACM Transactions on Multimedia Computing,
Communications, and Applications (TOMM), vol. 17, no. 1S, pp. 1-18, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[23] Flickr 8k Dataset, Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/adityajn105/flickr8k
[24] Hsankesara, Flickr
Image Dataset, Kaggle, 2018. [Online]. Available: https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset
[25] Andrej Karpathy, and Li
Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions,” Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, 2015.
[Google
Scholar] [Publisher
Link]
[26] Teng Jiang, Zehan Zhang, and Yupu Yang, “Modeling
Coverage with Semantic Embedding for Image Caption Generation,” The Visual Computer, vol. 35, no. 11,
pp. 1655-1665, 2019.
[CrossRef] [Google
Scholar] [Publisher
Link]
[27] Amish Patel, and
Aravind Varier, “Hyperparameter Analysis for Image Captioning,” arXiv preprint, pp. 1-10, 2020.
[CrossRef] [Google
Scholar] [Publisher
Link]
[28] Harshitha Katpally, and Ajay Bansal, “Ensemble
Learning on Deep Neural Networks for Image Caption Generation,” 2020 IEEE 14th International
Conference on Semantic Computing (ICSC), San Diego, CA, USA, pp. 61-68,
2020.
[CrossRef] [Google
Scholar] [Publisher
Link]
[29] J. Bineeshia, “Image
Caption Generation using CNN-LSTM based Approach,” ICCAP 2021: Proceedings
of the First International Conference on Combinatorial and Optimization,
Chennai, India, 2021.
[Google Scholar]
[30] Zagon Bussabong et al.,
“Enhancing Image Caption Performance with Improved Visual Attention Mechanism,”
ICIC Express Letters
Part B: Applications, vol. 16, no. 1, pp. 73-82, 2025.
[CrossRef] [Google
Scholar] [Publisher
Link]
[31] Yiwei Ma et al.,
“Towards Local Visual Modeling for Image Captioning,” Pattern Recognition, vol. 138, pp. 1-32, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[32] Israa Al Badarneh,
Bassam H. Hammo, and Omar Al-Kadi, “An Ensemble Model with Attention-based
Mechanism for Image Captioning,” Computers
and Electrical Engineering, vol. 123, pp. 1-35, 2025.
[CrossRef] [Google
Scholar] [Publisher
Link]
[33] Quanzeng You et al.,
“Image Captioning with Semantic Attention,” 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las
Vegas, NV, USA, pp. 4651-4659, 2016.
[CrossRef] [Google
Scholar] [Publisher
Link]
[34] Kun Fu et al., “Aligning
Where to See and What to Tell: Image Captioning with Region-based Attention and
Scene-Specific Contexts,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12,
pp. 2321-2334, 2016.
[CrossRef] [Google
Scholar] [Publisher
Link]
[35] Jiasen Lu et al.,
“Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image
Captioning,” 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp.
3242-3250, 2017.
[CrossRef] [Google
Scholar] [Publisher
Link]
[36] Chen He, and Haifeng
Hu, “Image Captioning with Text-based Visual Attention,” Neural Processing Letters, vol. 49, no. 1, pp. 177-185, 2018.
[CrossRef] [Google
Scholar] [Publisher
Link]
[37] Tiago do Carmo Nogueira et al., “Reference-based Model using Multimodal Gated Recurrent Units
for Image Captioning,” Multimedia Tools
and Applications, vol. 79, no. 41-42, pp. 30615-30635, 2020.
[CrossRef] [Google
Scholar] [Publisher
Link]
[38] Marimuthu Kalimuthu et
al., “Fusion Models for Improved Image Captioning,” Pattern Recognition, ICPR International Workshops and Challenges,
pp. 381-395, 2021.
[CrossRef] [Google
Scholar] [Publisher
Link]
[39] Amr Abdussalam et al.,
“Numcap: A Number-Controlled Multi-Caption Image Captioning Network,” ACM Transactions on Multimedia Computing,
Communications and Applications, vol. 19, no. 4, pp. 1-24, 2023.
[CrossRef] [Google
Scholar] [Publisher
Link]
[40] Jianlin Su et al.,
“Roformer: Enhanced Transformer with Rotary Position Embedding,” Neurocomputing, vol. 568, pp. 1-14,
2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[41] Shaowen Wang, Linxi Yu,
and Jian Li, “LoRA-GA: Low-Rank Adaptation with Gradient Approximation,” arXiv preprint, pp. 1-19, 2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[42] Yue Yang et al.,
“Remote Sensing Image Change Captioning using Multi-Attentive Network with
Diffusion Model,” Remote Sensing,
vol. 16, no. 21, pp. 2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[43] Yongshuo Zhu et al.,
“Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational
Knowledge and Semantic Guidance,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-16, 2024.
[CrossRef] [Google
Scholar] [Publisher
Link]
[44] Shuai Bai et al.,
“Qwen2.5-vl Technical Report,” arXiv
preprint, pp. 1-23, 2025.
[CrossRef] [Publisher
Link]
[45] Yuduo Wang, Weikang Yu,
and Pedram Ghamisi, “Change Captioning in Remote Sensing: Evolution to SAT-Cap-A
Single-Stage Transformer Approach,” arXiv
preprint, 1-18, 2025.
[CrossRef] [Google
Scholar] [Publisher
Link]