Research Article | Open Access | Download PDF
Volume 74 | Issue 5 | Year 2026 | Article Id. IJETT-V74I5P122 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I5P122Evaluating the Reliability of Large Language Models in Literary Analysis of Arabic Novels: A Structured Benchmark Using Grounded Textual Evidence
Emad A. Aldomour, Ameen Shaheen
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 17 Jan 2026 | 19 Mar 2026 | 28 Mar 2026 | 30 May 2026 |
Citation :
Emad A. Aldomour, Ameen Shaheen, "Evaluating the Reliability of Large Language Models in Literary Analysis of Arabic Novels: A Structured Benchmark Using Grounded Textual Evidence," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 5, pp. 329-341, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I5P122
Abstract
This study examines the reliability of a contemporary Large Language Model (LLM) in performing strictly text-bound literary analysis of the Arabic novel "الرصاصة الصديقة ". Despite growing use of LLMs in the humanities, their interpretive behavior in Arabic narrative contexts remains underexplored. Using the novel as a controlled corpus, the model was barred from external knowledge and evaluated under a Grounded-Evidence Protocol requiring all claims to be supported by explicit textual quotations. Output quality was assessed through a five-dimensional rubric measuring Textual Fidelity, Accuracy, Analytical Depth, Coherence, and Linguistic Quality. Quantitative results show strong performance in Coherence (4.50), Linguistic Quality (4.58), and Textual Fidelity (4.30), while Analytical Depth was moderate (3.95), indicating limitations in symbolic reasoning and culturally embedded interpretation. Qualitative error analysis reveals that unsupported or inflated symbolic readings typically emerge in ambiguous or metaphor-dense passages. The findings suggest that LLMs can provide coherent, well-grounded commentary but remain constrained in higher-order interpretation. The study proposes a structured benchmark for evaluating LLM reliability in narrative analysis and offers a methodological foundation for future work in computational humanities. All analyses in this study were generated using the ChatGPT-5.2 Large Language Model operating under a fully text-restricted environment.
Keywords
AI Evaluation, Arabic Linguistics, Computational Literary Studies, Digital Humanities, NLP.
References
[1] Chen Ling et al., “Domain Specialization as the Key to Make Large
Language Models Disruptive: A Comprehensive Survey,” ACM Computing Surveys,
vol. 58, no. 3, pp. 1-39, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Leonora Kaldaras, Kevin Haudek, and Joseph Krajcik, “Employing Automatic
Analysis Tools Aligned to Learning Progressions to Assess Knowledge Application
and Support Learning in STEM,” International Journal of STEM Education,
vol. 11, no. 1, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Lukas Thode, Uamr Iftikhar, and Daniel Mendez, “Exploring the use of LLMs
for the Selection Phase in Systematic Literature Studies,” Information and
Software Technology, vol. 184, pp. 1-10, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Hany Rashwan, “Literary Genre as a Theoretical Colonization by Modernism:
Arabic Balāghah and its Literariness in Ancient Egyptian Literature,” Interdisciplinary
Literary Studies, vol. 23, no. 1, pp. 24-68, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Lachhab Youssef et al., “Enhancing Arabic Aspect Category Detection using
Large Language Models (LLMs),” Results in Engineering, vol. 26, pp. 1-9,
2025.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Tahani S. Alazzam, Musa A. Alzghoul, and Raghad M. Alzghoul, “Exploring
AI’s Capability in Translating English Metaphors into Arabic,” Theory and
Practice in Language Studies, vol. 15, no. 7, pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Mohammad Hasan Altarawneh et al., “The Relationship between Cross-Cutting
Factors and Knowledge, Learning Outcomes, and Skills in Dual Degree Programs,” Journal
of Theoretical and Applied Information Technology, vol. 102, no. 8, pp.
3410-3422, 2024.
[Google Scholar] [Publisher Link]
[8] Saif Al Deen Lutfi Ali Al Ghammaz, “Revisiting William J. Shakespeare’s
the Tempest from a Colonial and Postcolonial Lens,” Theory and Practice in
Language Studies, vol. 13, no. 6, pp. 1373-1378, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Konstantinos I. Roumeliotis, Nikolaos D. Tselikas, and Dimitrios K.
Nasiopoulos, “LLMs and NLP Models in Cryptocurrency Sentiment Analysis: A Comparative
Classification Study,” Big Data and Cognitive Computing, vol. 8, no. 6,
pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Ahmed Abdelali et al., “LAraBench: Benchmarking Arabic AI with Large
Language Models,” Proceedings of the 18th Conference of the
European Chapter of the Association for Computational Linguistics,
Association for Computational Linguistics, St. Julian’s, Malta, pp. 487-520,
2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Ebtesam Almazrouei et al., “AlGhafa Evaluation Benchmark for Arabic
Language Models,” Proceedings of ArabicNLP, Association for
Computational Linguistics, pp. 244-275, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Safa AlBallaa et al., “GATmath and GATLc: Comprehensive Benchmarks for
Evaluating Arabic Large Language Models,” PLOS One, vol. 20, no. 9, pp.
1-24, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Aline Godfroid, Brittany Finch, and Joanne Koh, “Reporting Eye‐Tracking
Research in Second Language Acquisition and Bilingualism: A Synthesis and
Field‐Specific Guidelines,” Language Learning, vol. 75, no. 1, pp.
250-294, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Aysh Alhroob et al., “Enhancing Software Testing with Genetic Algorithm
and Binary Search: Integrating Error Classification and Debugging Through
Clustering,” Journal of Information Systems Engineering and Management,
vol. 10, no. 17s, pp. 117-125, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Manasa Koppula et al., “AI-Powered Chatbot for FDA Drug Labeling
Information Retrieval: OpenAI GPT for Grounded Question Answering,” Analytics,
vol. 4, no. 4, pp. 1-18, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Othmane Friha et al., “LLM-based Edge Intelligence: A Comprehensive
Survey on Architectures, Applications, Security and Trustworthiness,” IEEE
Open Journal of the Communications Society, vol. 5, pp. 5799-5856, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Nabil Arman, Faisal Khamayseh, and Eman Awad, “A Semi-Automated Approach
for Classifying Non Functional Arabic user Requirements using NLP Tools,” International
Journal of Advances in Soft Computing and its Applications, vol. 17, no. 1,
pp. 277-294, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Zichao Lin et al., “Towards Trustworthy LLMs: A Review on Debiasing and
Dehallucinating in Large Language Models,” Artificial Intelligence Review,
vol. 57, no. 9, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Sangwoo Heo, Sungwook Son, and Hyunwoo Park, “Halucheck: Integrating
Hallucination Detection Techniques in Llm-based Conversational Systems,” SSRN
Electronic Journal, pp. 1-30, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Stephanie Lin, Jacob Hilton, and Owain Evans, “TruthfulQA: Measuring How
Models Mimic Human Falsehoods,” Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, Dublin, Ireland, vol. 1, pp. 3214-3252, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Katie Matton et al., “Walk the Talk? Measuring the Faithfulness of Large
Language Model Explanations,” International Conference on Learning
Representations, vol. 2025, pp. 73212-73277, 2025.
[Google Scholar] [Publisher Link]
[22] Yue Wu, Peng Hu, and Derek D. Wang, “The AI Annotator: Large Language
Models’ Potential in Scoring Sustainability Reports,” Systems, vol. 13,
no. 10, pp. 1-28, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Percy Liang et al., “Holistic Evaluation of Language Models,” arXiv
preprint, pp. 1-162, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Karmel Shehadeh, Nabil Arman, and Faisal Khamayseh, “Classification of
Arabic user Requirements: A Semi-Automated Approach using NLP Tools,” International
Journal of Advances in Soft Computing and its Applications, vol. 16, no. 3,
pp. 1-14, 2024.
[CrossRef] [Google Scholar]
[25] Ahmed Adel ElSabagh,
Shahira Shaaban Azab, and Hesham Ahmed Hefny, “A Comprehensive Survey on Arabic
Text Augmentation: Approaches, Challenges, and Applications,” Neural
Computing and Applications, vol. 37, no. 10, pp. 7015-7048, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Preeti et al., “Quantitative Analysis of Literary Texts: Computational
Approaches in Digital Humanities Research,” Educational Administration:
Theory and Practice, vol. 30, no. 5, pp. 5234-5240, 2024.
[Google Scholar]
[27] Yang Liu et al., “Datasets
for Large Language Models: A Comprehensive Survey,” Artificial Intelligence
Review, vol. 58, no. 12, pp. 1-78, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Malak Mashabi, Shahad Al-Khalifa, and Hend Al-Khalifa, “A Survey of Large
Language Models for Arabic Language and its Dialects,” ACM Transactions on
Asian and Low-Resource Language Information Processing, pp. 1-44, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Juho Pääkkönen, and Petri Ylikoski, “Humanistic Interpretation and
Machine Learning,” Synthese, vol. 199, no. 1-2, pp. 1461-1497, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Rusul Mumtaz et al., “PDIS:
A Service Layer for Privacy and Detecting Intrusions in Cloud Computing,” International
Journal of Advances in Soft Computing and its Applications, vol. 14, no. 2,
pp. 15-35, 2022.
[CrossRef] [Google Scholar]
[31] Wael Alzyadat et al., “Big Data, Classification, Clustering and Generate
Rules: An Inevitably Intertwined for Prediction,” 2021 International
Conference on Information Technology (ICIT), Amman, Jordan, pp. 149-155,
2021.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Gerald Rau, and Yu-Shan Shih, “Evaluation of Cohen’s Kappa and other
Measures of Inter-Rater Agreement for Genre Analysis and Other Nominal Data,” Journal
of English for Academic Purposes, vol. 53, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Wael
Jumah Alzyadat et al., “Fuzzy Map Approach for Accruing Velocity of Big Data,” COMPUSOFT:
An International Journal of Advanced Computer Technology, vol. 8, no. 4,
pp. 1-5, 2019.
[Google
Scholar]