Table of Contents
Fetching ...

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Jin Liu, Qingquan Li, Wenlong Du

TL;DR

This paper argues that current LLM evaluation is limited by static, knowledge-centric benchmarks that fail to reflect real-world problem solving and optimization needs. It proposes a three-stage Benchmarking-Evaluation-Assessment paradigm that moves evaluation from an 'examination room' to a 'hospital', using coarse benchmarking, targeted task-solving evaluation, and doctor-model attribution to diagnose root causes and provide optimization guidance. The framework emphasizes dynamic, scenario-driven assessment and fine-grained metrics to guide practical improvements and reduce data leakage risks. If adopted, the paradigm could yield more actionable insights for developers and practitioners, improving LLM reliability and applicability across domains.

Abstract

In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

TL;DR

This paper argues that current LLM evaluation is limited by static, knowledge-centric benchmarks that fail to reflect real-world problem solving and optimization needs. It proposes a three-stage Benchmarking-Evaluation-Assessment paradigm that moves evaluation from an 'examination room' to a 'hospital', using coarse benchmarking, targeted task-solving evaluation, and doctor-model attribution to diagnose root causes and provide optimization guidance. The framework emphasizes dynamic, scenario-driven assessment and fine-grained metrics to guide practical improvements and reduce data leakage risks. If adopted, the paradigm could yield more actionable insights for developers and practitioners, improving LLM reliability and applicability across domains.

Abstract

In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
Paper Structure (7 sections, 2 figures)

This paper contains 7 sections, 2 figures.

Figures (2)

  • Figure 1: The comparison of measurement on human health and LLM's ability.
  • Figure 2: The architecture of our benchmarking-evaluation-assessment paradigm.