Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou; Zhenwen Liang; Haolin Liu; Wenhao Yu; Kishan Panaganti; Linfeng Song; Dian Yu; Xiangliang Zhang; Haitao Mi; Dong Yu

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu

TL;DR

EVOL-RL addresses entropy collapse in label-free LLM evolution by coupling a majority-voted selection anchor with a novelty-driven variation signal, optimized via GRPO. By maintaining stability while encouraging diverse reasoning, it prevents loss of exploration and enhances both in-domain and out-of-domain generalization, achieving substantial gains on math benchmarks and beyond. Ablation studies confirm the necessity of novelty, entropy regularization, and asymmetric clipping, and results extend to supervised GRPO settings, illustrating broad applicability and practical impact for autonomous model improvement without labels.

Abstract

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., GPQA, MMLU-Pro, and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

TL;DR

Abstract

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)