Evaluating the Reliability of Large Language Models in Literary Analysis of Arabic Novels: A Structured Benchmark Using Grounded Textual Evidence

Emad A. Aldomour; Ameen Shaheen

doi:https://doi.org/10.14445/22315381/IJETT-V74I5P122

Research Article | Open Access | Download PDF

Volume 74 | Issue 5 | Year 2026 | Article Id. IJETT-V74I5P122 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I5P122

Evaluating the Reliability of Large Language Models in Literary Analysis of Arabic Novels: A Structured Benchmark Using Grounded Textual Evidence

Emad A. Aldomour, Ameen Shaheen

Received	Revised	Accepted	Published
17 Jan 2026	19 Mar 2026	28 Mar 2026	30 May 2026

Citation :

Emad A. Aldomour, Ameen Shaheen, "Evaluating the Reliability of Large Language Models in Literary Analysis of Arabic Novels: A Structured Benchmark Using Grounded Textual Evidence," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 5, pp. 329-341, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I5P122

Abstract

This study examines the reliability of a contemporary Large Language Model (LLM) in performing strictly text-bound literary analysis of the Arabic novel "الرصاصة الصديقة ". Despite growing use of LLMs in the humanities, their interpretive behavior in Arabic narrative contexts remains underexplored. Using the novel as a controlled corpus, the model was barred from external knowledge and evaluated under a Grounded-Evidence Protocol requiring all claims to be supported by explicit textual quotations. Output quality was assessed through a five-dimensional rubric measuring Textual Fidelity, Accuracy, Analytical Depth, Coherence, and Linguistic Quality. Quantitative results show strong performance in Coherence (4.50), Linguistic Quality (4.58), and Textual Fidelity (4.30), while Analytical Depth was moderate (3.95), indicating limitations in symbolic reasoning and culturally embedded interpretation. Qualitative error analysis reveals that unsupported or inflated symbolic readings typically emerge in ambiguous or metaphor-dense passages. The findings suggest that LLMs can provide coherent, well-grounded commentary but remain constrained in higher-order interpretation. The study proposes a structured benchmark for evaluating LLM reliability in narrative analysis and offers a methodological foundation for future work in computational humanities. All analyses in this study were generated using the ChatGPT-5.2 Large Language Model operating under a fully text-restricted environment.

Keywords

AI Evaluation, Arabic Linguistics, Computational Literary Studies, Digital Humanities, NLP.

References

[1] Chen Ling et al., “Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey,” ACM Computing Surveys, vol. 58, no. 3, pp. 1-39, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Leonora Kaldaras, Kevin Haudek, and Joseph Krajcik, “Employing Automatic Analysis Tools Aligned to Learning Progressions to Assess Knowledge Application and Support Learning in STEM,” International Journal of STEM Education, vol. 11, no. 1, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Lukas Thode, Uamr Iftikhar, and Daniel Mendez, “Exploring the use of LLMs for the Selection Phase in Systematic Literature Studies,” Information and Software Technology, vol. 184, pp. 1-10, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Hany Rashwan, “Literary Genre as a Theoretical Colonization by Modernism: Arabic Balāghah and its Literariness in Ancient Egyptian Literature,” Interdisciplinary Literary Studies, vol. 23, no. 1, pp. 24-68, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Lachhab Youssef et al., “Enhancing Arabic Aspect Category Detection using Large Language Models (LLMs),” Results in Engineering, vol. 26, pp. 1-9, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Tahani S. Alazzam, Musa A. Alzghoul, and Raghad M. Alzghoul, “Exploring AI’s Capability in Translating English Metaphors into Arabic,” Theory and Practice in Language Studies, vol. 15, no. 7, pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Mohammad Hasan Altarawneh et al., “The Relationship between Cross-Cutting Factors and Knowledge, Learning Outcomes, and Skills in Dual Degree Programs,” Journal of Theoretical and Applied Information Technology, vol. 102, no. 8, pp. 3410-3422, 2024.
[Google Scholar] [Publisher Link]

[8] Saif Al Deen Lutfi Ali Al Ghammaz, “Revisiting William J. Shakespeare’s the Tempest from a Colonial and Postcolonial Lens,” Theory and Practice in Language Studies, vol. 13, no. 6, pp. 1373-1378, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Konstantinos I. Roumeliotis, Nikolaos D. Tselikas, and Dimitrios K. Nasiopoulos, “LLMs and NLP Models in Cryptocurrency Sentiment Analysis: A Comparative Classification Study,” Big Data and Cognitive Computing, vol. 8, no. 6, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Ahmed Abdelali et al., “LAraBench: Benchmarking Arabic AI with Large Language Models,” Proceedings of the 18^th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, St. Julian’s, Malta, pp. 487-520, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Ebtesam Almazrouei et al., “AlGhafa Evaluation Benchmark for Arabic Language Models,” Proceedings of ArabicNLP, Association for Computational Linguistics, pp. 244-275, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Safa AlBallaa et al., “GATmath and GATLc: Comprehensive Benchmarks for Evaluating Arabic Large Language Models,” PLOS One, vol. 20, no. 9, pp. 1-24, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Aline Godfroid, Brittany Finch, and Joanne Koh, “Reporting Eye‐Tracking Research in Second Language Acquisition and Bilingualism: A Synthesis and Field‐Specific Guidelines,” Language Learning, vol. 75, no. 1, pp. 250-294, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Aysh Alhroob et al., “Enhancing Software Testing with Genetic Algorithm and Binary Search: Integrating Error Classification and Debugging Through Clustering,” Journal of Information Systems Engineering and Management, vol. 10, no. 17s, pp. 117-125, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Manasa Koppula et al., “AI-Powered Chatbot for FDA Drug Labeling Information Retrieval: OpenAI GPT for Grounded Question Answering,” Analytics, vol. 4, no. 4, pp. 1-18, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Othmane Friha et al., “LLM-based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness,” IEEE Open Journal of the Communications Society, vol. 5, pp. 5799-5856, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Nabil Arman, Faisal Khamayseh, and Eman Awad, “A Semi-Automated Approach for Classifying Non Functional Arabic user Requirements using NLP Tools,” International Journal of Advances in Soft Computing and its Applications, vol. 17, no. 1, pp. 277-294, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Zichao Lin et al., “Towards Trustworthy LLMs: A Review on Debiasing and Dehallucinating in Large Language Models,” Artificial Intelligence Review, vol. 57, no. 9, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[19] Sangwoo Heo, Sungwook Son, and Hyunwoo Park, “Halucheck: Integrating Hallucination Detection Techniques in Llm-based Conversational Systems,” SSRN Electronic Journal, pp. 1-30, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Stephanie Lin, Jacob Hilton, and Owain Evans, “TruthfulQA: Measuring How Models Mimic Human Falsehoods,” Proceedings of the 60^th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Dublin, Ireland, vol. 1, pp. 3214-3252, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[21] Katie Matton et al., “Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations,” International Conference on Learning Representations, vol. 2025, pp. 73212-73277, 2025.
[Google Scholar] [Publisher Link]

[22] Yue Wu, Peng Hu, and Derek D. Wang, “The AI Annotator: Large Language Models’ Potential in Scoring Sustainability Reports,” Systems, vol. 13, no. 10, pp. 1-28, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[23] Percy Liang et al., “Holistic Evaluation of Language Models,” arXiv preprint, pp. 1-162, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[24] Karmel Shehadeh, Nabil Arman, and Faisal Khamayseh, “Classification of Arabic user Requirements: A Semi-Automated Approach using NLP Tools,” International Journal of Advances in Soft Computing and its Applications, vol. 16, no. 3, pp. 1-14, 2024.
[CrossRef] [Google Scholar]

[25] Ahmed Adel ElSabagh, Shahira Shaaban Azab, and Hesham Ahmed Hefny, “A Comprehensive Survey on Arabic Text Augmentation: Approaches, Challenges, and Applications,” Neural Computing and Applications, vol. 37, no. 10, pp. 7015-7048, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[26] Preeti et al., “Quantitative Analysis of Literary Texts: Computational Approaches in Digital Humanities Research,” Educational Administration: Theory and Practice, vol. 30, no. 5, pp. 5234-5240, 2024.
[Google Scholar]

[27] Yang Liu et al., “Datasets for Large Language Models: A Comprehensive Survey,” Artificial Intelligence Review, vol. 58, no. 12, pp. 1-78, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[28] Malak Mashabi, Shahad Al-Khalifa, and Hend Al-Khalifa, “A Survey of Large Language Models for Arabic Language and its Dialects,” ACM Transactions on Asian and Low-Resource Language Information Processing, pp. 1-44, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[29] Juho Pääkkönen, and Petri Ylikoski, “Humanistic Interpretation and Machine Learning,” Synthese, vol. 199, no. 1-2, pp. 1461-1497, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[30] Rusul Mumtaz et al., “PDIS: A Service Layer for Privacy and Detecting Intrusions in Cloud Computing,” International Journal of Advances in Soft Computing and its Applications, vol. 14, no. 2, pp. 15-35, 2022.
[CrossRef] [Google Scholar]

[31] Wael Alzyadat et al., “Big Data, Classification, Clustering and Generate Rules: An Inevitably Intertwined for Prediction,” 2021 International Conference on Information Technology (ICIT), Amman, Jordan, pp. 149-155, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[32] Gerald Rau, and Yu-Shan Shih, “Evaluation of Cohen’s Kappa and other Measures of Inter-Rater Agreement for Genre Analysis and Other Nominal Data,” Journal of English for Academic Purposes, vol. 53, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[33] Wael Jumah Alzyadat et al., “Fuzzy Map Approach for Accruing Velocity of Big Data,” COMPUSOFT: An International Journal of Advanced Computer Technology, vol. 8, no. 4, pp. 1-5, 2019.
[Google Scholar]