Tourism MATE-LLM: Multimodal Cross-Attention Fusion of Skip-Gram and ConvNext Embeddings for Tourism Destination Recommendation

V Indumathy; K Shantha Kumari

doi:https://doi.org/10.14445/22315381/IJETT-V74I3P126

Research Article | Open Access | Download PDF

Volume 74 | Issue 3 | Year 2026 | Article Id. IJETT-V74I3P126 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I3P126

Tourism MATE-LLM: Multimodal Cross-Attention Fusion of Skip-Gram and ConvNext Embeddings for Tourism Destination Recommendation

V Indumathy, K Shantha Kumari

Received	Revised	Accepted	Published
06 Dec 2025	27 Jan 2026	06 Feb 2026	28 Mar 2026

Citation :

V Indumathy, K Shantha Kumari, "Tourism MATE-LLM: Multimodal Cross-Attention Fusion of Skip-Gram and ConvNext Embeddings for Tourism Destination Recommendation," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 3, pp. 376-387, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I3P126

Abstract

Multimodal data in the Tourism domain requires a precise representation with useful content. Tourism requires data from huge sources to satisfy the tourists' demands. From the basic needs to luxury things, tourists demand information about the tourist area for planning. This requires the integrated representation of the tourist area that covers the various factors demanded by tourists. Based on this objective, the proposed work uses a cross-attention model to fuse the multimodal data using the ConvNeXt extracted image features and text feature extraction using the Skip Gram model. The cross-attention model takes the important correlation factors between the different data inputs based on the weights and feature values. The proposed work attained 58% data fusion using two modal type datasets based on tourism and produced =an accuracy of 85.28% using the India Tourism dataset from Kaggle and the India Tourist destination dataset from Mendeley.

Keywords

ConvNeXt, Cross attention model. Data fusion, Multimodal, Tourism.

References

[1] Qazi Waqas Khan et al., “Multi-Modal Fusion Approaches for Tourism: A Comprehensive Survey of Data-Sets, Fusion Techniques, Recent Architectures, and Future Directions,” Computers and Electrical Engineering, vol. 116, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[2] YaoGuang Li, and HeChi Gan, “Tourism Information Data Processing Method based on Multi‐Source Data Fusion,” Journal of Sensors, vol. 2021, no. 1, pp. 1-12, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Meng Li, “Research on Extraction of Useful Tourism Online Reviews based on Multimodal Feature Fusion,” Transactions on Asian and Low-Resource Language Information Processing, vol. 20, no. 5, pp. 1-16, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Lijuan Wang et al., “Multimodal Event-Aware Network for Sentiment Analysis in Tourism,” IEEE MultiMedia, vol. 28, no. 2, pp. 49-58, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Ankita Gandhi et al., “Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions,” Information Fusion, vol. 91, pp. 424-444, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Hongwei Wang, and Wenzheng Liu, “Forecasting Tourism Demand by a Novel Multi-Factor Fusion Approach,” IEEE Access, vol. 10, pp. 125972-125991, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Yuhang Cui, Shengbin Liang, and YuYing Zhang, “Multimodal Representation Learning for Tourism Recommendation with Two-Tower Architecture,” Plos one, vol. 19, no. 2, pp. 1-23, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Zhongyuan Yang, and Jiaping Chen, “Optimisation of Tourism Scenic Area View Planning and Design based on Multimodal Fusion,” Computer-Aided Design and Applications, vol. 22, pp. 201-214, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Yi Liu, Yougen Jiang, and Qiuju Luo, “Advancing Tourism Resilience and Data Science using Multimodal Data,” Journal of Policy Research in Tourism, Leisure and Events, pp. 1-11, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Julian Monsalve-Pulido, Carlos Alberto Parra, and Jose Aguilar, “Multimodal Model for the Spanish Sentiment Analysis in a Tourism Domain,” Social Network Analysis and Mining, vol. 14, no. 1, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Hiromasa Yamanishi, Ling Xiao, and Toshihiko Yamasaki, “A Multimodal Dataset and Benchmark for Tourism Review Generation,” 18^th ACM Conference on Recommender Systems, Bari, Italy, pp. 1-19, 2024.
[Google Scholar]

[12] Xi Shao, Guijin Tang, and Bing-Kun Bao, “Personalised Travel Recommendation based on Sentiment-Aware Multimodal Topic Model,” IEEE Access, vol. 7, pp. 113043-113052, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Yongqi Zhang et al., “How Does Multi‐Modal Travel Enhance Tourist Attraction Accessibility? A Refined Two‐Step Floating Catchment Area Method using Multi‐Source Data,” Transactions in GIS, vol. 28, no. 2, pp. 278-302, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[14]Jinghui Yang et al., “A Cross-Attention-based Multi-Information Fusion Transformer for Hyperspectral Image Classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 13358-13375, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Fulin Xu et al., “Bridging CNN and Transformer with Cross Attention Fusion Network for Hyperspectral Image Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-14, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Wei-Yu Lee, Ljubomir Jovanov, and Wilfried Philips, “Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection,” Computer Vision - ECCV 2022 Workshops Tel Aviv, Israel, Tel Aviv, Israel, pp. 608-623, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Jing Zhang et al., “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 4257-4268, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Xingchen Zou et al., “Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook,” Information Fusion, vol. 113, pp. 1-38, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[19] Jinhu Qi et al., “Research on Tibetan Tourism: Viewpoints Information Generation System based on LLM,” 2024 12^th International Conference on Intelligent Computing and Wireless Optical Communications (ICWOC), Chongqing, China, pp. 35-41, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Zhuang Liu et al., “A Convnet for the 2020s,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976-11986. 2022.
[Google Scholar] [Publisher Link]

[21] Tomas Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv preprint, pp. 1-12, 2013.
[CrossRef] [Google Scholar] [Publisher Link]

[22] Tejaswini Bhosale, “Indian Tourist Destination,” Mendeley Data, vol. 1, 2024.
[CrossRef] [Publisher Link]