Table of Contents
Fetching ...

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang

TL;DR

This paper tackles the core problem of inconsistent and unreliable evaluation of large language models (LLMs) amid rapid methodological and dataset growth. It conducts a systematic survey of benchmarks, evaluation methodologies, and metrics, identifying pervasive issues in reproducibility, data contamination, parsing practices, and cross-benchmark generalizability. The authors offer perspectives and concrete recommendations to improve reproducibility, reliability, and robustness, including dynamic prompting, open-source evaluators, and hybrid automatic-human evaluation pipelines. The work aims to standardize evaluation practices, better align metrics with human judgments, and guide future research toward trustworthy, scalable LLM assessment with real-world relevance.

Abstract

Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

TL;DR

This paper tackles the core problem of inconsistent and unreliable evaluation of large language models (LLMs) amid rapid methodological and dataset growth. It conducts a systematic survey of benchmarks, evaluation methodologies, and metrics, identifying pervasive issues in reproducibility, data contamination, parsing practices, and cross-benchmark generalizability. The authors offer perspectives and concrete recommendations to improve reproducibility, reliability, and robustness, including dynamic prompting, open-source evaluators, and hybrid automatic-human evaluation pipelines. The work aims to standardize evaluation practices, better align metrics with human judgments, and guide future research toward trustworthy, scalable LLM assessment with real-world relevance.

Abstract

Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
Paper Structure (35 sections)