Table of Contents
Fetching ...

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Xiyun Xu, Yang Song, Yiming Jia, Yuntao Wen, Yunzhi Xu, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen

TL;DR

Nanbeige4.1-3B presents a unified 3B generalist capable of reasoning, coding, and long-horizon agentic tasks by integrating point-wise and pair-wise reward modeling with a staged, multi-domain training pipeline. Key innovations include a 256k-context SFT phase, a depth-enhanced data construction pipeline for deep search, a judge-driven coding data workflow with a two-stage code RL strategy, and turn- plus trajectory-level credit assignments to sustain long-horizon planning. Empirical results show strong cross-domain performance, outperforming open-source 3B baselines and rivaling larger models on many benchmarks, including live-code and multi-hop search tasks, with notable success in LeetCode challenges. The work demonstrates that carefully aligned objectives and data pipelines can yield broad competence and strong specialization in compact models, with practical open-source impact for research on efficient agent-enabled language systems.

Abstract

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

TL;DR

Nanbeige4.1-3B presents a unified 3B generalist capable of reasoning, coding, and long-horizon agentic tasks by integrating point-wise and pair-wise reward modeling with a staged, multi-domain training pipeline. Key innovations include a 256k-context SFT phase, a depth-enhanced data construction pipeline for deep search, a judge-driven coding data workflow with a two-stage code RL strategy, and turn- plus trajectory-level credit assignments to sustain long-horizon planning. Empirical results show strong cross-domain performance, outperforming open-source 3B baselines and rivaling larger models on many benchmarks, including live-code and multi-hop search tasks, with notable success in LeetCode challenges. The work demonstrates that carefully aligned objectives and data pipelines can yield broad competence and strong specialization in compact models, with practical open-source impact for research on efficient agent-enabled language systems.

Abstract

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.
Paper Structure (54 sections, 3 equations, 4 figures, 6 tables)

This paper contains 54 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Performance of Nanbeige4-3B-Thinking vs. Qwen model series.
  • Figure 2: A data construction pipeline for deep search, including complex multi-hop QA sampling and the synthesis of long-horizon reasoning trajectories.
  • Figure 3: Gated time-complexity reward design in code RL. The time reward $R_{\mathrm{time}}$ is activated only when a solution passes all test cases ($\mathrm{PassRate}=1$), and the judge system provides online feedback by comparing the predicted time complexity against the reference optimal bound.
  • Figure 4: Training dynamics of two-stage code RL. We track the reward (including the gated $R_{\mathrm{time}}$ in Stage 2) and LiveCodeBench performance across training, showing consistent improvements from Stage 1 to Stage 2.