Table of Contents
Fetching ...

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach

TL;DR

This paper addresses whether large language models (LLMs) generalize across task difficulties. It applies Item Response Theory (IRT) to estimate per-example difficulty using thousands of model outputs from the Open LLM Leaderboard across six benchmarks, then trains and evaluates instruction-tuned models on fine-grained bins. The key finding is that cross-difficulty generalization is limited and diminishes as train-test difficulty gaps grow, challenging the idea that easy or hard data alone can generalize broadly. It argues for difficulty-aware data curation and evaluation to ensure robust performance across the full spectrum of tasks.

Abstract

We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

TL;DR

This paper addresses whether large language models (LLMs) generalize across task difficulties. It applies Item Response Theory (IRT) to estimate per-example difficulty using thousands of model outputs from the Open LLM Leaderboard across six benchmarks, then trains and evaluates instruction-tuned models on fine-grained bins. The key finding is that cross-difficulty generalization is limited and diminishes as train-test difficulty gaps grow, challenging the idea that easy or hard data alone can generalize broadly. It argues for difficulty-aware data curation and evaluation to ensure robust performance across the full spectrum of tasks.

Abstract

We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

Paper Structure

This paper contains 41 sections, 1 equation, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Comparison of human-defined and IRT difficulty estimates for three datasets. Each dot represents one question. Left: MATH question distribution by number of reasoning steps hendrycksmath2021. Top right: MMLU-Pro question distribution by grade level wang2024mmlupro, with questions lacking assigned grades grouped as "Other Disciplines". Bottom right: ARC question distribution by grade level clark2018arc. All distributions are shown across IRT difficulty score bins.
  • Figure 2: Heatmaps showing Spearman correlations between IRT difficulty scores and human metrics. Colors indicate correlation strength from negative (red) to positive (blue). ARC shows weak positive correlations across all metrics, while MMLU-Pro demonstrates mostly no or negative correlation between IRT difficulty and common human metrics for difficulty.
  • Figure 3: Cross-difficulty generalization heatmaps for Qwen2.5 14B Instruct on MMLU Pro dataset. Left: Performance when training on a difficulty bin (y-axis) and testing on another difficulty bin (x-axis). Right: Improvement from finetuning on each bin compared to the zero-shot performance of the model on that bin. Diagonal elements are masked as they represent the same train and test data.
  • Figure 4: Improvement analysis for Qwen2.5 14B Instruct showing the difference between SFT and zero-shot performance. Blue indicates positive improvements (SFT better than zero-shot), red indicates negative improvements (SFT worse than zero-shot).
  • Figure 5: Zero-shot performance of Qwen 3 4B Instruct 2507 and Qwen 3 30B-A3B Instruct 2507 on the same benchmarks we evaluate against, divided by IRT difficulty bins. These models exhibit lower performance on more difficult bins, despite not being calibrated using their model responses.
  • ...and 14 more figures