Research Article | Open Access | Download PDF
Volume 74 | Issue 3 | Year 2026 | Article Id. IJETT-V74I3P126 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I3P126Tourism MATE-LLM: Multimodal Cross-Attention Fusion of Skip-Gram and ConvNext Embeddings for Tourism Destination Recommendation
V Indumathy, K Shantha Kumari
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 06 Dec 2025 | 27 Jan 2026 | 06 Feb 2026 | 28 Mar 2026 |
Citation :
V Indumathy, K Shantha Kumari, "Tourism MATE-LLM: Multimodal Cross-Attention Fusion of Skip-Gram and ConvNext Embeddings for Tourism Destination Recommendation," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 3, pp. 376-387, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I3P126
Abstract
Multimodal data in the Tourism domain requires a precise representation with useful content. Tourism requires data from huge sources to satisfy the tourists' demands. From the basic needs to luxury things, tourists demand information about the tourist area for planning. This requires the integrated representation of the tourist area that covers the various factors demanded by tourists. Based on this objective, the proposed work uses a cross-attention model to fuse the multimodal data using the ConvNeXt extracted image features and text feature extraction using the Skip Gram model. The cross-attention model takes the important correlation factors between the different data inputs based on the weights and feature values. The proposed work attained 58% data fusion using two modal type datasets based on tourism and produced =an accuracy of 85.28% using the India Tourism dataset from Kaggle and the India Tourist destination dataset from Mendeley.
Keywords
ConvNeXt, Cross attention model. Data fusion, Multimodal, Tourism.
References
[1] Qazi Waqas Khan et al.,
“Multi-Modal Fusion Approaches for Tourism: A Comprehensive Survey of
Data-Sets, Fusion Techniques, Recent Architectures, and Future Directions,” Computers
and Electrical Engineering, vol. 116, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[2] YaoGuang Li, and HeChi Gan, “Tourism
Information Data Processing Method based on Multi‐Source Data Fusion,” Journal
of Sensors, vol. 2021, no. 1, pp. 1-12, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Meng Li, “Research on
Extraction of Useful Tourism Online Reviews based on Multimodal Feature
Fusion,” Transactions on Asian and Low-Resource Language Information
Processing, vol. 20, no. 5, pp. 1-16, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Lijuan Wang et al.,
“Multimodal Event-Aware Network for Sentiment Analysis in Tourism,” IEEE
MultiMedia, vol. 28, no. 2, pp. 49-58, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Ankita Gandhi et al.,
“Multimodal Sentiment Analysis: A Systematic Review of History, Datasets,
Multimodal Fusion Methods, Applications, Challenges and Future Directions,” Information
Fusion, vol. 91, pp. 424-444, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Hongwei Wang, and Wenzheng
Liu, “Forecasting Tourism Demand by a Novel Multi-Factor Fusion Approach,” IEEE
Access, vol. 10, pp. 125972-125991, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Yuhang Cui, Shengbin Liang,
and YuYing Zhang, “Multimodal Representation Learning for Tourism
Recommendation with Two-Tower Architecture,” Plos one, vol. 19, no. 2,
pp. 1-23, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Zhongyuan Yang, and Jiaping
Chen, “Optimisation of Tourism Scenic Area View Planning and Design based on
Multimodal Fusion,” Computer-Aided Design and Applications, vol. 22, pp.
201-214, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Yi Liu, Yougen Jiang, and
Qiuju Luo, “Advancing Tourism Resilience and Data Science using Multimodal
Data,” Journal of Policy Research in Tourism, Leisure and Events, pp.
1-11, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Julian Monsalve-Pulido, Carlos Alberto Parra, and Jose Aguilar,
“Multimodal Model for the Spanish Sentiment Analysis in a Tourism Domain,” Social
Network Analysis and Mining, vol. 14, no. 1, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Hiromasa Yamanishi, Ling Xiao, and Toshihiko Yamasaki, “A Multimodal
Dataset and Benchmark for Tourism Review Generation,” 18th ACM
Conference on Recommender Systems, Bari, Italy, pp. 1-19, 2024.
[Google Scholar]
[12] Xi Shao, Guijin Tang, and Bing-Kun Bao, “Personalised Travel
Recommendation based on Sentiment-Aware Multimodal Topic Model,” IEEE Access,
vol. 7, pp. 113043-113052, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Yongqi Zhang et al., “How Does Multi‐Modal Travel Enhance Tourist
Attraction Accessibility? A Refined Two‐Step Floating Catchment Area Method
using Multi‐Source Data,” Transactions in GIS, vol. 28, no. 2, pp.
278-302, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[14]Jinghui Yang et al., “A Cross-Attention-based Multi-Information Fusion
Transformer for Hyperspectral Image Classification,” IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17,
pp. 13358-13375, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Fulin Xu et al., “Bridging CNN and Transformer with Cross Attention
Fusion Network for Hyperspectral Image Classification,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 62, pp. 1-14, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Wei-Yu Lee, Ljubomir Jovanov,
and Wilfried Philips, “Cross-Modality Attention and Multimodal Fusion
Transformer for Pedestrian Detection,” Computer
Vision - ECCV 2022 Workshops Tel Aviv, Israel, Tel Aviv, Israel, pp.
608-623, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Jing Zhang et al., “Cross on Cross Attention: Deep Fusion Transformer
for Image Captioning,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 33, no. 8, pp. 4257-4268, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Xingchen Zou et al., “Deep Learning for Cross-Domain Data Fusion in
Urban Computing: Taxonomy, Advances, and Outlook,” Information Fusion,
vol. 113, pp. 1-38, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Jinhu Qi et al., “Research on
Tibetan Tourism: Viewpoints Information Generation System based on LLM,” 2024
12th International Conference on Intelligent Computing and Wireless
Optical Communications (ICWOC), Chongqing, China, pp. 35-41, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Zhuang Liu et al., “A Convnet for the 2020s,” Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.
11976-11986. 2022.
[Google Scholar] [Publisher Link]
[21] Tomas Mikolov et al., “Efficient Estimation of Word Representations in
Vector Space,” arXiv preprint, pp. 1-12, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Tejaswini Bhosale, “Indian
Tourist Destination,” Mendeley Data, vol. 1, 2024.
[CrossRef] [Publisher Link]