Novel Optimization Approach for Weighted Metric Evaluation in Question Answering Systems Using Genetic Algorithm and Grey Wolf Optimizer

Priyanka K; Toshima Jaiswal; Nandhini Kumaresh; Jayapriya J; Vinay M

doi:https://doi.org/10.14445/22315381/IJETT-V74I5P121

Research Article | Open Access | Download PDF

Volume 74 | Issue 5 | Year 2026 | Article Id. IJETT-V74I5P121 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I5P121

Novel Optimization Approach for Weighted Metric Evaluation in Question Answering Systems Using Genetic Algorithm and Grey Wolf Optimizer

Priyanka K, Toshima Jaiswal, Nandhini Kumaresh, Jayapriya J, Vinay M

Received	Revised	Accepted	Published
14 Jan 2026	06 Mar 2026	12 Mar 2026	30 May 2026

Citation :

Priyanka K, Toshima Jaiswal, Nandhini Kumaresh, Jayapriya J, Vinay M, "Novel Optimization Approach for Weighted Metric Evaluation in Question Answering Systems Using Genetic Algorithm and Grey Wolf Optimizer," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 5, pp. 312-328, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I5P121

Abstract

Automated Question Answering(QA) systems are essential building blocks of modern Natural Language Processing, powering a range of informative tasks like virtual assistants, customer support bots, learning tutorial systems, and search engines. With the increasing usage of QA systems, it is also important that evaluation is carried out in a precise and comprehensive manner. The Exact Match and F1-score metrics are mainly focused on word-level similarities, without considering semantic understanding, contextual consistency, and logical consistency. However, it is found that existing QA evaluation schemes are based on fixed metrics or combinations of metrics, which limits their flexibility across different evaluation scenarios and alignment with human judgments. The relative importance of these features could be different with regard to the particular question-answering task or domain. To mitigate such limitations, this paper presents a new, task-adaptive evaluation protocol that blends five heterogeneous and complementary scoring metrics: BERTScore, BLEU, Entailment Score, Normalized Perplexity, and a Contrastive Penalty. Acknowledging that different QA tasks may place different priorities on answer quality, this approach learns optimal weight distributions for each metric component instead of fixed weights. The contribution of this work is the application of two bio-inspired optimization algorithms for making optimal selections of such weights: Genetic Algorithm, which is explicitly used to facilitate better management of human-annotated answer quality with emphasis on contrastive error penalty, and Grey Wolf Optimizer, which optimizes a composite loss function that best balances all the metric components with lower computational overhead. The work also explores a hybrid view by studying and comparing individual strengths of both optimization approaches under a typical experimental setup. Experiments conducted on a curated subset of the SQuAD v2.0 dataset, augmented with contrastive examples to simulate real-world vagueness, demonstrate that both approaches perform better than traditional static metrics in agreement with human judgments. Genetic Algorithm is contrast-sensitive, while Grey Wolf Optimizer is semantically coherent and computationally efficient. These approaches together provide a general, adaptive framework of comprehensive QA evaluation, which could be adapted into various application scenarios.

Keywords

Genetic Algorithm (GA), Grey Wolf Optimizer (GWO), BERTScore, BLEU, Entailment Score, Normalized Perplexity, Contrastive Penalty.

References

[1] Joseph Weizenbaum, “ELIZA-A Computer Program for the Study of Natural Language Communication Between Man and Machine,” Communications of the ACM, vol. 26, no. 1, pp. 23-28, 1983.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Terry Winograd, “Understanding Natural Language,” Cognitive Psychology, vol. 3, no. 1, pp. 1-191, 1972.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Jacob Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Minneapolis, Minnesota Minneapolis, vol. 1, pp. 4171-4186, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Yinhan Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint, pp. 1-13, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Kevin Clark et al., “ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators,” arXiv preprint, pp. 1-18, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Pranav Rajpurkar et al., “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages, Austin, Texas, pp. 2383-2392, 2016.
[Google Scholar]

[7] Pranav Rajpurkar, Robin Jia, and Percy Liang, “Know What you do not Know: Unanswerable Questions for SQuAD,” Proceedings of the 56^th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics Melbourne, Melbourne, Australia, vol. 2, pp. 784-789, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Tianyi Zhang et al., “BERTScore: Evaluating Text Generation with BERT,” arXiv preprint, pp. 1-43, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Thibault Sellam, Dipanjan Das, and Ankur Parikh, “BLEURT: Learning Robust Metrics for Text Generation,” Proceedings of the 58^th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 7881-7892, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Nils Reimers, and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9^thInternational Joint Conference on Natural Language Processing, Association for Computational Linguistics, Hong Kong, China, pp. 3982-3992, 2019. [CrossRef] [Google Scholar] [Publisher Link]

[11] Ricardo Rei et al., “COMET: A Neural Framework for MT Evaluation,” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 2685-2702, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Kishore Papineni et al., “BLEU: A Method for Automatic Evaluation of Machine Translation,” Proceedings of the 40^th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pp. 311-318, 2002.
[Google Scholar]

[13] Dang Hoang Long et al., “An Entailment-based Scoring Method for Content Selection in Document Summarization,” Proceedings of the 9^th International Symposium on Information and Communication Technology, Association for Computing Machinery, New York, NY, United States, pp. 122-129, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Rohan Ramanath, Monojit Choudhury, and Kalika Bali, “Entailment: An Effective Metric for Comparing Hierarchical and Non-Hierarchical Annotation Schemes,” Proceedings of the 7^thLinguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, pp. 42-50, 2013.
[Google Scholar]

[15] F. Jelinek et al., “Perplexity-A Measure of the Difficulty of Speech Recognition Tasks,” The journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63-S63, 1977.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Tianyu Gao, Xingcheng Yao, and Danqi Che, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, EMNLP, pp. 6894-6910, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Alexandre Blansché, Pierre Gançarski, and Jerzy J. Korczak, “Genetic Algorithms for Feature Weighting: Evolution vs. Coevolution and Darwin vs. Lamarck,” MICAI 2005: Advances in Artificial Intelligence: 4^th Mexican International Conference on Artificial Intelligence, Monterrey, Mexico, vol. 3789, pp. 682-691, 2005.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase, “A New Local Search based Hybrid Genetic Algorithm for Feature Selection,” Neurocomputing, vol. 74, no. 17, pp. 2914-2928, 2011.
[CrossRef] [Google Scholar] [Publisher Link]

[19] Benteng Ma, and Yong Xia, “A Tribe Competition-based Genetic Algorithm for Feature Selection in Pattern Classification,” Applied Soft Computing, vol. 58, pp. 328-338, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Mohammed Ghaith Altarabichi et al., “Fast Genetic Algorithm for Feature Selection-A Qualitative Approximation Approach,” Proceedings of the Companion Conference on Genetic and Evolutionary Computation, Association for Computing Machinery, New York, NY, United States, pp. 11-12, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[21] Abdullah Konak, David W. Coit, and Alice E. Smith, “Multi-Objective Optimization using Genetic Algorithms: A Tutorial,” Reliability Engineering and System Safety, vol. 91, no. 9, pp. 992-1007, 2006.
[CrossRef] [Google Scholar] [Publisher Link]

[22] Seyedali Mirjalili, Seyed Mohammad Mirjalili, and Andrew Lewis, “Grey Wolf Optimizer,” Advances in Engineering Software, vol. 69, pp. 46-61, 2014.
[CrossRef] [Google Scholar] [Publisher Link]