Research Article | Open Access | Download PDF
Volume 74 | Issue 5 | Year 2026 | Article Id. IJETT-V74I5P121 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I5P121Novel Optimization Approach for Weighted Metric Evaluation in Question Answering Systems Using Genetic Algorithm and Grey Wolf Optimizer
Priyanka K, Toshima Jaiswal, Nandhini Kumaresh, Jayapriya J, Vinay M
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 14 Jan 2026 | 06 Mar 2026 | 12 Mar 2026 | 30 May 2026 |
Citation :
Priyanka K, Toshima Jaiswal, Nandhini Kumaresh, Jayapriya J, Vinay M, "Novel Optimization Approach for Weighted Metric Evaluation in Question Answering Systems Using Genetic Algorithm and Grey Wolf Optimizer," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 5, pp. 312-328, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I5P121
Abstract
Automated Question Answering(QA) systems are essential building blocks of modern Natural Language Processing, powering a range of informative tasks like virtual assistants, customer support bots, learning tutorial systems, and search engines. With the increasing usage of QA systems, it is also important that evaluation is carried out in a precise and comprehensive manner. The Exact Match and F1-score metrics are mainly focused on word-level similarities, without considering semantic understanding, contextual consistency, and logical consistency. However, it is found that existing QA evaluation schemes are based on fixed metrics or combinations of metrics, which limits their flexibility across different evaluation scenarios and alignment with human judgments. The relative importance of these features could be different with regard to the particular question-answering task or domain. To mitigate such limitations, this paper presents a new, task-adaptive evaluation protocol that blends five heterogeneous and complementary scoring metrics: BERTScore, BLEU, Entailment Score, Normalized Perplexity, and a Contrastive Penalty. Acknowledging that different QA tasks may place different priorities on answer quality, this approach learns optimal weight distributions for each metric component instead of fixed weights. The contribution of this work is the application of two bio-inspired optimization algorithms for making optimal selections of such weights: Genetic Algorithm, which is explicitly used to facilitate better management of human-annotated answer quality with emphasis on contrastive error penalty, and Grey Wolf Optimizer, which optimizes a composite loss function that best balances all the metric components with lower computational overhead. The work also explores a hybrid view by studying and comparing individual strengths of both optimization approaches under a typical experimental setup. Experiments conducted on a curated subset of the SQuAD v2.0 dataset, augmented with contrastive examples to simulate real-world vagueness, demonstrate that both approaches perform better than traditional static metrics in agreement with human judgments. Genetic Algorithm is contrast-sensitive, while Grey Wolf Optimizer is semantically coherent and computationally efficient. These approaches together provide a general, adaptive framework of comprehensive QA evaluation, which could be adapted into various application scenarios.
Keywords
Genetic Algorithm (GA), Grey Wolf Optimizer (GWO), BERTScore, BLEU, Entailment Score, Normalized Perplexity, Contrastive Penalty.
References
[1] Joseph Weizenbaum, “ELIZA-A Computer Program for the
Study of Natural Language Communication Between Man and Machine,” Communications
of the ACM, vol. 26, no. 1, pp. 23-28, 1983.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Terry Winograd, “Understanding Natural Language,” Cognitive
Psychology, vol. 3, no. 1, pp. 1-191, 1972.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Jacob Devlin et al., “BERT: Pre-Training of Deep
Bidirectional Transformers for Language Understanding,” Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Association for Computational Linguistics, Minneapolis, Minnesota Minneapolis,
vol. 1, pp. 4171-4186, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Yinhan Liu et al., “RoBERTa: A Robustly Optimized BERT
Pretraining Approach,” arXiv preprint, pp. 1-13, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Kevin Clark et al., “ELECTRA: Pre-Training Text
Encoders as Discriminators Rather than Generators,” arXiv preprint, pp.
1-18, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Pranav Rajpurkar et al., “SQuAD: 100,000+ Questions
for Machine Comprehension of Text,” Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pages, Austin, Texas, pp.
2383-2392, 2016.
[Google Scholar]
[7] Pranav Rajpurkar, Robin Jia, and Percy Liang, “Know
What you do not Know: Unanswerable Questions for SQuAD,” Proceedings of the
56th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics Melbourne, Melbourne, Australia, vol. 2, pp. 784-789, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Tianyi Zhang et al., “BERTScore: Evaluating Text
Generation with BERT,” arXiv preprint, pp. 1-43, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Thibault Sellam, Dipanjan Das, and Ankur Parikh,
“BLEURT: Learning Robust Metrics for Text Generation,” Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 7881-7892,
2020.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Nils Reimers,
and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese
BERT-Networks,” Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing, Association for Computational Linguistics, Hong Kong, China, pp. 3982-3992,
2019. [CrossRef] [Google Scholar] [Publisher Link]
[11] Ricardo Rei et
al., “COMET: A Neural Framework for MT Evaluation,” Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 2685-2702,
2020.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Kishore
Papineni et al., “BLEU: A Method for Automatic Evaluation of Machine
Translation,” Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, Philadelphia, pp. 311-318, 2002.
[Google Scholar]
[13] Dang Hoang
Long et al., “An Entailment-based Scoring Method for Content Selection in
Document Summarization,” Proceedings of the 9th International
Symposium on Information and Communication Technology, Association for
Computing Machinery, New York, NY, United States, pp. 122-129, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Rohan
Ramanath, Monojit Choudhury, and Kalika Bali, “Entailment: An Effective Metric
for Comparing Hierarchical and Non-Hierarchical Annotation Schemes,” Proceedings
of the 7th Linguistic Annotation Workshop and Interoperability with
Discourse, Sofia, Bulgaria, pp. 42-50, 2013.
[Google Scholar]
[15] F. Jelinek et
al., “Perplexity-A Measure of the Difficulty of Speech Recognition Tasks,” The
journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63-S63,
1977.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Tianyu Gao,
Xingcheng Yao, and Danqi Che, “SimCSE: Simple Contrastive Learning of Sentence
Embeddings,” Proceedings of the 2021 Conference on Empirical Methods in
Natural Language Processing, Association for Computational Linguistics, EMNLP, pp. 6894-6910,
2021.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Alexandre
Blansché, Pierre Gançarski, and Jerzy J. Korczak, “Genetic Algorithms for
Feature Weighting: Evolution vs. Coevolution and Darwin vs. Lamarck,” MICAI
2005: Advances in Artificial Intelligence: 4th Mexican International
Conference on Artificial Intelligence, Monterrey, Mexico, vol. 3789, pp.
682-691, 2005.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Md. Monirul
Kabir, Md. Shahjahan, and Kazuyuki Murase, “A New Local Search based Hybrid
Genetic Algorithm for Feature Selection,” Neurocomputing, vol. 74, no.
17, pp. 2914-2928, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Benteng Ma,
and Yong Xia, “A Tribe Competition-based Genetic Algorithm for Feature
Selection in Pattern Classification,” Applied Soft Computing, vol. 58,
pp. 328-338, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Mohammed
Ghaith Altarabichi et al., “Fast Genetic Algorithm for Feature Selection-A
Qualitative Approximation Approach,” Proceedings of the Companion Conference
on Genetic and Evolutionary Computation, Association for Computing
Machinery, New York, NY, United States, pp. 11-12, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Abdullah
Konak, David W. Coit, and Alice E. Smith, “Multi-Objective Optimization using
Genetic Algorithms: A Tutorial,” Reliability Engineering and System Safety,
vol. 91, no. 9, pp. 992-1007, 2006.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Seyedali Mirjalili,
Seyed Mohammad Mirjalili, and Andrew Lewis, “Grey Wolf Optimizer,” Advances
in Engineering Software, vol. 69, pp. 46-61, 2014.
[CrossRef] [Google
Scholar]
[Publisher Link]