Table of Contents
Fetching ...

LIMR: Less is More for RL Scaling

Xuefeng Li, Haoyang Zou, Pengfei Liu

TL;DR

The paper interrogates whether simply increasing RL training data yields better reasoning in language models, arguing that data quality and sample alignment with learning dynamics are more critical. It introduces Learning Impact Measurement (LIM) to quantify each sample's value by how well its learning trajectory aligns with the model's overall learning progress, and uses this to curate a LIMR subset of 1,389 samples from 8,523. Empirically, LIMR matches or surpasses full-data RL performance on AIME24, MATH500, and AMC2023, despite using far less data, and outperforms data-efficient supervised fine-tuning approaches on small models. The work offers a practical, reproducible approach for data-efficient RL in LLMs and suggests that targeted sample selection can substantially reduce compute without sacrificing, and sometimes improving, reasoning capabilities.

Abstract

In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models' reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly underperforms at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https://github.com/GAIR-NLP/LIMR.

LIMR: Less is More for RL Scaling

TL;DR

The paper interrogates whether simply increasing RL training data yields better reasoning in language models, arguing that data quality and sample alignment with learning dynamics are more critical. It introduces Learning Impact Measurement (LIM) to quantify each sample's value by how well its learning trajectory aligns with the model's overall learning progress, and uses this to curate a LIMR subset of 1,389 samples from 8,523. Empirically, LIMR matches or surpasses full-data RL performance on AIME24, MATH500, and AMC2023, despite using far less data, and outperforms data-efficient supervised fine-tuning approaches on small models. The work offers a practical, reproducible approach for data-efficient RL in LLMs and suggests that targeted sample selection can substantially reduce compute without sacrificing, and sometimes improving, reasoning capabilities.

Abstract

In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models' reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly underperforms at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https://github.com/GAIR-NLP/LIMR.
Paper Structure (17 sections, 3 equations, 4 figures, 3 tables)

This paper contains 17 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) The accuracy on AIME24 across using different training datasets in RL without any data distillation and SFT training as cold start.. Our specifically curated LIMR dataset, a strategically selected subset from the full dataset, MATH (level 3-5), achieved comparable accuracy levels while utilizing less than one-sixth of the data volume. Notably, LIM significantly outperformed a randomly selected dataset of equivalent size, demonstrating the effectiveness of our selective dataset construction methodology. (b) A comparison of different data-efficient models. The results reveal that directly applying SFT on the LIMO ye2025limoreasoning and s1 muennighoff2025s1simpletesttimescaling datasets with Qwen-Math-7B yields significantly inferior results compared to using RL with LIMR, implying that, for small models, RL is more effective in achieving data efficiency.
  • Figure 2: (a) Learning dynamics analysis of training samples from MATH-FULL dataset across epochs. Solution reward trajectories reveal diverse patterns: samples maintaining near-zero rewards, samples quickly achieving high rewards, and those showing dynamic learning progress with varying improvement rates. (b) Sample learning trajectories compared against the average reward curve (red). Higher LIM scores reflect better alignment with model's learning trajectory, where trajectories showing similar growth patterns receive higher scores.
  • Figure 3: Performance and training dynamics
  • Figure 4: Accuracy on various benchmarks