Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Xiyun Xu, Yang Song, Yiming Jia, Yuntao Wen, Yunzhi Xu, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen
TL;DR
Nanbeige4.1-3B presents a unified 3B generalist capable of reasoning, coding, and long-horizon agentic tasks by integrating point-wise and pair-wise reward modeling with a staged, multi-domain training pipeline. Key innovations include a 256k-context SFT phase, a depth-enhanced data construction pipeline for deep search, a judge-driven coding data workflow with a two-stage code RL strategy, and turn- plus trajectory-level credit assignments to sustain long-horizon planning. Empirical results show strong cross-domain performance, outperforming open-source 3B baselines and rivaling larger models on many benchmarks, including live-code and multi-hop search tasks, with notable success in LeetCode challenges. The work demonstrates that carefully aligned objectives and data pipelines can yield broad competence and strong specialization in compact models, with practical open-source impact for research on efficient agent-enabled language systems.
Abstract
We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.
