Table of Contents
Fetching ...

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

Sen Xu, Yi Zhou, Wei Wang, Jixin Min, Zhibin Yin, Yingwei Dai, Shixi Liu, Lianyu Pang, Yirong Chen, Junlin Zhang

TL;DR

This work introduces VibeThinker-1.5B, a compact 1.5B-parameter model trained with the Spectrum-to-Signal Principle (SSP) to achieve strong reasoning with minimal cost. By separating the training into a diversity-focused Spectrum Phase (Two-Stage Diversity-Exploring Distillation) and a signal-focused MGPO RL Phase (MaxEnt-Guided Policy Optimization), the approach yields a rich solution spectrum that the RL phase then amplifies, enabling the model to outperform far larger counterparts on math benchmarks like AIME24/25 and HMMT25, as well as coding tasks on LiveCodeBench. The model achieves these results at under $8K in post-training costs and ~3900 GPU-hours, suggesting small models can approach large-model reasoning with substantial cost and energy savings. These findings prompt a reevaluation of Scaling Laws for reasoning and highlight the potential for broader participation in AI research through efficient, diversity-driven training paradigms.

Abstract

Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium's 50.3 and its base model's 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

TL;DR

This work introduces VibeThinker-1.5B, a compact 1.5B-parameter model trained with the Spectrum-to-Signal Principle (SSP) to achieve strong reasoning with minimal cost. By separating the training into a diversity-focused Spectrum Phase (Two-Stage Diversity-Exploring Distillation) and a signal-focused MGPO RL Phase (MaxEnt-Guided Policy Optimization), the approach yields a rich solution spectrum that the RL phase then amplifies, enabling the model to outperform far larger counterparts on math benchmarks like AIME24/25 and HMMT25, as well as coding tasks on LiveCodeBench. The model achieves these results at under $8K in post-training costs and ~3900 GPU-hours, suggesting small models can approach large-model reasoning with substantial cost and energy savings. These findings prompt a reevaluation of Scaling Laws for reasoning and highlight the potential for broader participation in AI research through efficient, diversity-driven training paradigms.

Abstract

Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium's 50.3 and its base model's 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.

Paper Structure

This paper contains 13 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Performance of VibeThinker-1.5B versus competing models on representative benchmarks.
  • Figure 2: VibeThinker-1.5B demonstrates remarkable efficiency, surpassing much larger and stronger models on the challenging AIME25 benchmark. It achieves a score of 74.4, outperforming strong baselines such as GPT-OSS-20B-Medium (72.1/20B), DeepSeek-R1-0120 (70.0/671B), and Seed-Thinking v1.5 (74.0/200B).
  • Figure 3: The Training Pipeline of VibeThinker-1.5B