Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Jun Wang; Ninglun Gu; Kailai Zhang; Zijiao Zhang; Yelun Bao; Jin Yang; Xu Yin; Liwei Liu; Yihuan Liu; Pengyong Li; Gary G. Yen; Junchi Yan

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan

TL;DR

The paper tackles the disconnect between benchmark scores and real-world utility in LLM evaluation by proposing an anthropomorphic framework that casts model capabilities through the lenses of $IQ$ (General Intelligence), $PQ$ (Professional Expertise), and $EQ$ (Alignment Ability), augmented by a Value Quotient ($VQ$) to assess societal impact. It introduces a modular, six-component LLM evaluation harness that organizes benchmarks, models, prompts, metrics, tasks, leaderboards/arenas, and analysis, and it analyzes 200+ benchmarks to identify gaps such as dynamic assessment and interpretability. Domain- and alignment-focused PQ and EQ benchmarks are cataloged across healthcare, finance, law, telecom, coding, software, and science, while RAG, agent, and chatbot applications are examined under a system-level lens. The work offers actionable guidance, a roadmap for future evaluation practices, and a public repository of open-source resources to drive development of LLMs that are technically proficient, contextually relevant, and ethically responsible, with broad societal benefits.

Abstract

For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

TL;DR

The paper tackles the disconnect between benchmark scores and real-world utility in LLM evaluation by proposing an anthropomorphic framework that casts model capabilities through the lenses of

(General Intelligence),

(Professional Expertise), and

(Alignment Ability), augmented by a Value Quotient (

) to assess societal impact. It introduces a modular, six-component LLM evaluation harness that organizes benchmarks, models, prompts, metrics, tasks, leaderboards/arenas, and analysis, and it analyzes 200+ benchmarks to identify gaps such as dynamic assessment and interpretability. Domain- and alignment-focused PQ and EQ benchmarks are cataloged across healthcare, finance, law, telecom, coding, software, and science, while RAG, agent, and chatbot applications are examined under a system-level lens. The work offers actionable guidance, a roadmap for future evaluation practices, and a public repository of open-source resources to drive development of LLMs that are technically proficient, contextually relevant, and ethically responsible, with broad societal benefits.

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

TL;DR

Abstract

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)