Table of Contents
Fetching ...

A Survey of Useful LLM Evaluation

Ji-Lun Peng, Sijia Cheng, Egil Diau, Yung-Yu Shih, Po-Heng Chen, Yen-Ting Lin, Yun-Nung Chen

TL;DR

This work presents a two-stage framework to evaluate the usefulness of large language models by separating core abilities (reasoning, safety/truthfulness, and domain knowledge) from agent capabilities (planning, tool use, and embodied tasks). It systematically surveys existing methods, benchmarks, and datasets across reasoning types, societal impact, finance/legal/psychology/medicine/education domains, and a broad set of agent scenarios (web grounding, code generation, API calls, robotics, etc.). Key contributions include organizing evaluation methods by stage, highlighting gaps (e.g., dynamic evaluation, root-cause analysis, fine-grained agent evaluation), and proposing directions to develop more automated, robust benchmarks and robot benchmarks. The paper’s insights aim to guide developers and researchers in building and assessing LLMs that are trustworthy, capable tools with practical, real-world impact across domains and embodied tasks.

Abstract

LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and responsibility they should undertake. Our study mainly discussed how LLMs, as useful tools, should be effectively assessed. We proposed the two-stage framework: from ``core ability'' to ``agent'', clearly explaining how LLMs can be applied based on their specific capabilities, along with the evaluation methods in each stage. Core ability refers to the capabilities that LLMs need in order to generate high-quality natural language texts. After confirming LLMs possess core ability, they can solve real-world and complex tasks as agent. In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs. In the ``agent'' stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications. Finally, we examined the challenges currently confronting the evaluation methods for LLMs, as well as the directions for future development.

A Survey of Useful LLM Evaluation

TL;DR

This work presents a two-stage framework to evaluate the usefulness of large language models by separating core abilities (reasoning, safety/truthfulness, and domain knowledge) from agent capabilities (planning, tool use, and embodied tasks). It systematically surveys existing methods, benchmarks, and datasets across reasoning types, societal impact, finance/legal/psychology/medicine/education domains, and a broad set of agent scenarios (web grounding, code generation, API calls, robotics, etc.). Key contributions include organizing evaluation methods by stage, highlighting gaps (e.g., dynamic evaluation, root-cause analysis, fine-grained agent evaluation), and proposing directions to develop more automated, robust benchmarks and robot benchmarks. The paper’s insights aim to guide developers and researchers in building and assessing LLMs that are trustworthy, capable tools with practical, real-world impact across domains and embodied tasks.

Abstract

LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and responsibility they should undertake. Our study mainly discussed how LLMs, as useful tools, should be effectively assessed. We proposed the two-stage framework: from ``core ability'' to ``agent'', clearly explaining how LLMs can be applied based on their specific capabilities, along with the evaluation methods in each stage. Core ability refers to the capabilities that LLMs need in order to generate high-quality natural language texts. After confirming LLMs possess core ability, they can solve real-world and complex tasks as agent. In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs. In the ``agent'' stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications. Finally, we examined the challenges currently confronting the evaluation methods for LLMs, as well as the directions for future development.
Paper Structure (46 sections, 3 figures, 2 tables)

This paper contains 46 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The two-stage framework of our LLMs evaluation.
  • Figure 2: The overview of core ability evaluation.
  • Figure 3: The overview of agent evaluation.