Table of Contents
Fetching ...

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan

TL;DR

The paper tackles the disconnect between benchmark scores and real-world utility in LLM evaluation by proposing an anthropomorphic framework that casts model capabilities through the lenses of $IQ$ (General Intelligence), $PQ$ (Professional Expertise), and $EQ$ (Alignment Ability), augmented by a Value Quotient ($VQ$) to assess societal impact. It introduces a modular, six-component LLM evaluation harness that organizes benchmarks, models, prompts, metrics, tasks, leaderboards/arenas, and analysis, and it analyzes 200+ benchmarks to identify gaps such as dynamic assessment and interpretability. Domain- and alignment-focused PQ and EQ benchmarks are cataloged across healthcare, finance, law, telecom, coding, software, and science, while RAG, agent, and chatbot applications are examined under a system-level lens. The work offers actionable guidance, a roadmap for future evaluation practices, and a public repository of open-source resources to drive development of LLMs that are technically proficient, contextually relevant, and ethically responsible, with broad societal benefits.

Abstract

For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

TL;DR

The paper tackles the disconnect between benchmark scores and real-world utility in LLM evaluation by proposing an anthropomorphic framework that casts model capabilities through the lenses of (General Intelligence), (Professional Expertise), and (Alignment Ability), augmented by a Value Quotient () to assess societal impact. It introduces a modular, six-component LLM evaluation harness that organizes benchmarks, models, prompts, metrics, tasks, leaderboards/arenas, and analysis, and it analyzes 200+ benchmarks to identify gaps such as dynamic assessment and interpretability. Domain- and alignment-focused PQ and EQ benchmarks are cataloged across healthcare, finance, law, telecom, coding, software, and science, while RAG, agent, and chatbot applications are examined under a system-level lens. The work offers actionable guidance, a roadmap for future evaluation practices, and a public repository of open-source resources to drive development of LLMs that are technically proficient, contextually relevant, and ethically responsible, with broad societal benefits.

Abstract

For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

Paper Structure

This paper contains 54 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of contents of this paper (zoom in).
  • Figure 2: The proposed technical evolutionary tree of the LLM evaluation, following the structure in gao2024retrievalaugmentedgenerationlargelanguage for RAG. The anthropomorphic evaluation framework: IQ-EQ-PQ taxonomy with evolutionary correspondence to LLM training stages. Intelligence Quotient (IQ)-General Intelligence denotes knowledge capacity acquired by pre-training, reflecting foundational reasoning and world knowledge breadth. Professional Quotient (PQ)-Professional Expertise represents task capability developed through supervised fine-tuning (SFT), measuring proficiency in specialized domains. Emotional Quotient (EQ)-Alignment Ability represents human preference alignment achieved through RL post-training, encompassing emotional and ethical resonance with human values.
  • Figure 3: Typology of the LLM Evaluation Modules.
  • Figure 4: Value-oriented Evaluation for LLMs.