Table of Contents
Fetching ...

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Chengzhi Wei, Lin Yan, Yonghui Wu, Qi Zhang, Xuanjing Huang

TL;DR

The EvaLearn benchmark is introduced, a pioneering benchmark designed to evaluate large language models on their learning capability and efficiency in challenging tasks, and it is observed that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance.

Abstract

We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

TL;DR

The EvaLearn benchmark is introduced, a pioneering benchmark designed to evaluate large language models on their learning capability and efficiency in challenging tasks, and it is observed that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance.

Abstract

We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.

Paper Structure

This paper contains 29 sections, 8 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: Overview of EvaLearn. Instead of parallel evaluation, EvaLearn requires models to solve problems sequentially, thereby systematically evaluating the LLM's learning ability and efficiency.
  • Figure 2: (Left) Average number of consecutive correct solutions ($N_{\mathrm{consec}}$). (Right) Average position of the first correct solution ($P_\text{first}$). Results are shown for OpenAI-o3-mini, with each node representing a sequence.
  • Figure 3: (Left) Post-warmup accuracy ($\mathrm{Acc}_{\mathrm{pw}}\text{-K}$) results of Claude-3.7-Sonnet-Thinking. (Right) Position-wise accuracy curve and its slope $k$ of DeepSeek-R1.
  • Figure 4: The case used in Case Study 1 is designed for comparison between Parallel Solving (Zero-shot) and Sequence Solving (Feedback Learning).
  • Figure 5: The case used in Case Study 2 provides evidence of learning from prior experience.
  • ...and 19 more figures