Table of Contents
Fetching ...

LongCat-Flash Technical Report

Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhuai Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su

TL;DR

LongCat-Flash tackles the dual challenge of scaling large language models efficiently and enabling agentic capabilities by introducing Zero-Computation Experts and Shortcut-connected MoE (ScMoE), paired with a rigorous stability framework and a three-stage pretraining/posttraining regime. The model achieves high throughput and low latency (over 100 tokens/s inference at ~0.7 dollars per million output tokens) while training on over 20 trillion tokens in 30 days, enabled by deterministic computation, advanced routing/bias control, and overlapping compute/communication. A comprehensive training/inference stack—covering hyperparameter transfer, model growth initialization, variance alignment, and data decontamination—supports robust scaling to 560B parameters with an average of ~27B active per token. Post-training data synthesis, agentic task design, and VitaBench-style evaluation demonstrate strong agentic tool-use and safe, instruction-following capabilities, with an open-source release to accelerate community-driven research in efficient MoE architectures and agentic AI.

Abstract

We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat

LongCat-Flash Technical Report

TL;DR

LongCat-Flash tackles the dual challenge of scaling large language models efficiently and enabling agentic capabilities by introducing Zero-Computation Experts and Shortcut-connected MoE (ScMoE), paired with a rigorous stability framework and a three-stage pretraining/posttraining regime. The model achieves high throughput and low latency (over 100 tokens/s inference at ~0.7 dollars per million output tokens) while training on over 20 trillion tokens in 30 days, enabled by deterministic computation, advanced routing/bias control, and overlapping compute/communication. A comprehensive training/inference stack—covering hyperparameter transfer, model growth initialization, variance alignment, and data decontamination—supports robust scaling to 560B parameters with an average of ~27B active per token. Post-training data synthesis, agentic task design, and VitaBench-style evaluation demonstrate strong agentic tool-use and safe, instruction-following capabilities, with an open-source release to accelerate community-driven research in efficient MoE architectures and agentic AI.

Abstract

We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat

Paper Structure

This paper contains 62 sections, 13 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Benchmark performance of LongCat-Flash.
  • Figure 2: The architecture adopted in LongCat-Flash. Each layer employs Shortcut-connected Mixture of Experts (ScMoE) with zero-computation experts. ScMoE significantly expands the computation-communication window to boost training and inference efficiency. The zero-computation experts enable dynamic computation based on contextual importance, improving the efficiency of computational resource utilization.
  • Figure 3: (a) Validation loss curve comparing models with/without zero-computation experts under matched computation budgets. The baseline (top-k=8, blue) activates fixed 6B parameters per token, while the zero-expert variant (top-k=12, orange) dynamically activates 4.2B-7.0B parameters but maintains 8 FFN experts expectation (with fluctuation less than 1%). The consistent loss reduction demonstrates the efficacy of zero-computation experts. (b) The average number of activated FFN experts during LongCat-Flash training. The average number remains closely around 8, corresponding to expected 27B activated parameters. (c) The standard deviation of activated FFN experts grows to 3, indicating substantial variability in activated parameters across different tokens.
  • Figure 4: Training loss curves comparing baseline models (without ScMoE) against their ScMoE-enhanced counterparts across four different model configurations. In all experiments—(a) 2.4B-16B with MLA, (b) 3B-20B with MHA, and (c) 15B-193B with GQA—the loss curves are virtually indistinguishable. This provides robust evidence that the ScMoE optimization is quality-neutral, and its benefits are orthogonal to both model scale and the specific attention architecture used.
  • Figure 5: (a) Incorporating the scale-correction factor on MLA showing improved convergence (lower loss) on a 1B activated MOE model. (b) Validataion loss curve of a 6B activated MoE model in model growth experiments.
  • ...and 6 more figures