Table of Contents
Fetching ...

MiniCPM4: Ultra-Efficient LLMs on End Devices

MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengda Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Baoxi Ji, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Xin Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Lushi Pu, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Zheng Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Xiaoyue Xu, Yukun Yan, Jiarui Yuan, Jinqian Zhang, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Chuyue Zhou, Ge Zhou, Jie Zhou, Wei Zhou, Yanghao Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun

TL;DR

MiniCPM4 targets ultra-efficient LLMs for end devices by innovating across architecture, data, training, and inference. It introduces InfLLM v2, a trainable sparse attention that sustains long-context processing with high sparsity, paired with UltraClean/UltraChat v2 data pipelines and ModelTunnel v2 for efficient pre-training. In post-training, UltraChat v2 and chunk-wise RL, along with BitCPM4 QAT, enable strong reasoning and on-device deployment, while CPM.cu and ArkInfer provide a cross-platform, efficient inference stack. Evaluations show competitive performance with far fewer pre-training tokens (e.g., 8T vs 36T for comparators) and up to about 7x decoding speedups on edge hardware, plus effective long-context handling and on-device tool usage. The work demonstrates practical, scalable on-device LLMs with strong reasoning capabilities and broad deployment potential.

Abstract

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Furthermore, we construct a hybrid reasoning model, MiniCPM4.1, which can be used in both deep reasoning mode and non-reasoning mode. Evaluation results demonstrate that MiniCPM4 and MiniCPM4.1 outperform similar-sized open-source models across benchmarks, with the 8B variants showing significant speed improvements on long sequence understanding and generation.

MiniCPM4: Ultra-Efficient LLMs on End Devices

TL;DR

MiniCPM4 targets ultra-efficient LLMs for end devices by innovating across architecture, data, training, and inference. It introduces InfLLM v2, a trainable sparse attention that sustains long-context processing with high sparsity, paired with UltraClean/UltraChat v2 data pipelines and ModelTunnel v2 for efficient pre-training. In post-training, UltraChat v2 and chunk-wise RL, along with BitCPM4 QAT, enable strong reasoning and on-device deployment, while CPM.cu and ArkInfer provide a cross-platform, efficient inference stack. Evaluations show competitive performance with far fewer pre-training tokens (e.g., 8T vs 36T for comparators) and up to about 7x decoding speedups on edge hardware, plus effective long-context handling and on-device tool usage. The work demonstrates practical, scalable on-device LLMs with strong reasoning capabilities and broad deployment potential.

Abstract

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Furthermore, we construct a hybrid reasoning model, MiniCPM4.1, which can be used in both deep reasoning mode and non-reasoning mode. Evaluation results demonstrate that MiniCPM4 and MiniCPM4.1 outperform similar-sized open-source models across benchmarks, with the 8B variants showing significant speed improvements on long sequence understanding and generation.

Paper Structure

This paper contains 55 sections, 11 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Inference Speed Evaluation on end-side GPUs.
  • Figure 2: The illustration of InfLLM v2. Each query group selects parts of key-value blocks for attention computation, where the initial tokens and local tokens in the sliding window are always selected.
  • Figure 3: The illustration of high-quality data filtering pipelines. Traditional model-based data filtering methods (a) and (b) rely on human expertise for seed data selection and lack data quality verification.
  • Figure 4: The sigmoid relationship between loss and downstream performance on ScalingBench.
  • Figure 5: The relationship between language modeling loss and ratio of QAT post-training tokens (proportion of full stable-phase tokens).
  • ...and 3 more figures