Table of Contents
Fetching ...

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu

TL;DR

EVOL-RL addresses entropy collapse in label-free LLM evolution by coupling a majority-voted selection anchor with a novelty-driven variation signal, optimized via GRPO. By maintaining stability while encouraging diverse reasoning, it prevents loss of exploration and enhances both in-domain and out-of-domain generalization, achieving substantial gains on math benchmarks and beyond. Ablation studies confirm the necessity of novelty, entropy regularization, and asymmetric clipping, and results extend to supervised GRPO settings, illustrating broad applicability and practical impact for autonomous model improvement without labels.

Abstract

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., GPQA, MMLU-Pro, and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

TL;DR

EVOL-RL addresses entropy collapse in label-free LLM evolution by coupling a majority-voted selection anchor with a novelty-driven variation signal, optimized via GRPO. By maintaining stability while encouraging diverse reasoning, it prevents loss of exploration and enhances both in-domain and out-of-domain generalization, achieving substantial gains on math benchmarks and beyond. Ablation studies confirm the necessity of novelty, entropy regularization, and asymmetric clipping, and results extend to supervised GRPO settings, illustrating broad applicability and practical impact for autonomous model improvement without labels.

Abstract

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., GPQA, MMLU-Pro, and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.

Paper Structure

This paper contains 32 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: TTRL's entropy collapse vs. EVOL-RL's diversity preservation on Qwen3-4B-Base (trained label-free on MATH-500). Majority-only TTRL drives pass@$n>1$ down, shortens reasoning, and collapses entropy, whereas Evol-RL improves accuracy, sustains reasoning diversity.
  • Figure 2: An overview of the Evol-RL framework. For each prompt, the policy generates multiple responses. These are grouped by their final answer to identify the majority group. A novelty score is then computed for each response based on its semantic dissimilarity to others. Finally, a reward is assigned based on both majority (selection) and novelty (variation), guiding the policy update via GRPO. In the illustration, colors group responses by their final answer, while different marker shapes indicate semantically distinct reasoning paths.
  • Figure 3: Training dynamics for Evol-RL and TTRL. Left: models trained on MATH-TRAIN. Middle: models trained on MATH-500. Right: models trained on AIME24. Each panel plots, over training steps, (i) Pass@1 on AIME25, (ii) average response length on the training set, and (iii) policy entropy on the training set.
  • Figure 4: Performance of Evol-RL's exploration-enhancing components when applied to a standard supervised GRPO baseline. The Qwen3-4B-Base model is trained on the MATH trainig set hendrycks2021measuring with a ground-truth verifier (RLVR).
  • Figure 5: Training dynamics of the majority-vote accuracy (maj@16) for Evol-RL and TTRL. Each panel plots the accuracy of the consensus answer derived from 16 rollouts over the course of training. The training datasets are: (Left) MATH-TRAIN, (Middle) MATH-500, and (Right) AIME24.
  • ...and 1 more figures