Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Jin Liu; Qingquan Li; Wenlong Du

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Jin Liu, Qingquan Li, Wenlong Du

TL;DR

This paper argues that current LLM evaluation is limited by static, knowledge-centric benchmarks that fail to reflect real-world problem solving and optimization needs. It proposes a three-stage Benchmarking-Evaluation-Assessment paradigm that moves evaluation from an 'examination room' to a 'hospital', using coarse benchmarking, targeted task-solving evaluation, and doctor-model attribution to diagnose root causes and provide optimization guidance. The framework emphasizes dynamic, scenario-driven assessment and fine-grained metrics to guide practical improvements and reduce data leakage risks. If adopted, the paradigm could yield more actionable insights for developers and practitioners, improving LLM reliability and applicability across domains.

Abstract

In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

TL;DR

Abstract

Paper Structure (7 sections, 2 figures)

This paper contains 7 sections, 2 figures.

Introduction
Potential Issues in Current Paradigm of Benchmarking LLMs
Limited Benchmarking Capability: Knowledge
Evaluation Datasets Lack Dynamic Updating
Inadequacy of Evaluation Metrics for Guiding Model Optimization
Benckmarking-Evaluation-Assessment: A New Paradigm for Measuring the Capability Level of LLMs
Conclusion

Figures (2)

Figure 1: The comparison of measurement on human health and LLM's ability.
Figure 2: The architecture of our benchmarking-evaluation-assessment paradigm.

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

TL;DR

Abstract

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)