52B to 1T: Lessons Learned via Tele-FLM Series

Xiang Li; Yiqun Yao; Xin Jiang; Xuezhi Fang; Chao Wang; Xinzhang Liu; Zihan Wang; Yu Zhao; Xin Wang; Yuyao Huang; Shuangyong Song; Yongxiang Li; Zheng Zhang; Bo Zhao; Aixin Sun; Yequan Wang; Zhongjiang He; Zhongyuan Wang; Xuelong Li; Tiejun Huang

52B to 1T: Lessons Learned via Tele-FLM Series

Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

TL;DR

This work tackles scaling large language models by progressing Tele-FLM from $52B$ to $1T$ parameters through two main levers: data-efficient supervised fine-tuning for alignment and a function-preserving progressive growth protocol across staged expansions $52B \to 102B \to 1T$. The approach demonstrates that a modest, high-quality SFT dataset ($\sim$30k samples) can yield robust instruction-following in Chinese, while targeted data improves maths and reasoning performance, as shown on AlignBench and TeleEval. The growth strategy, built on the MSG framework, expands width and depth with masks and distance-based layer selection, enabling a smooth post-growth convergence and knowledge preservation across stages. The Tele-FLM-Chat results approach GPT-4 performance on Chinese tasks, and the released Tele-FLM-1T weights promise to accelerate further research under practical compute constraints and inform scalable training of extremely large models.

Abstract

Large Language Models (LLMs) represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion-parameter model. We delve into two primary areas: we first discuss our observation of Supervised Fine-tuning (SFT) on Tele-FLM-52B, which supports the "less is more" approach for SFT data construction; second, we demonstrate our experiments and analyses on the best practices for progressively growing a model from 52 billion to 102 billion, and subsequently to 1 trillion parameters. We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research.

52B to 1T: Lessons Learned via Tele-FLM Series

TL;DR

This work tackles scaling large language models by progressing Tele-FLM from

parameters through two main levers: data-efficient supervised fine-tuning for alignment and a function-preserving progressive growth protocol across staged expansions

. The approach demonstrates that a modest, high-quality SFT dataset (

30k samples) can yield robust instruction-following in Chinese, while targeted data improves maths and reasoning performance, as shown on AlignBench and TeleEval. The growth strategy, built on the MSG framework, expands width and depth with masks and distance-based layer selection, enabling a smooth post-growth convergence and knowledge preservation across stages. The Tele-FLM-Chat results approach GPT-4 performance on Chinese tasks, and the released Tele-FLM-1T weights promise to accelerate further research under practical compute constraints and inform scalable training of extremely large models.

Abstract

Paper Structure (11 sections, 4 tables)

This paper contains 11 sections, 4 tables.

Introduction
Tele-FLM-Chat
Supervised Fine-tuning
Evaluation
AlignBench
TeleEval
Tele-FLM-1T
Model Architecture
Growth Strategies
Pre-training Details
Lessons Learned

52B to 1T: Lessons Learned via Tele-FLM Series

TL;DR

Abstract

52B to 1T: Lessons Learned via Tele-FLM Series

Authors

TL;DR

Abstract

Table of Contents