Table of Contents
Fetching ...

Tele-FLM Technical Report

Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

TL;DR

Tele-FLM addresses the challenge of scalable open LLMs by introducing a 52B multilingual decoder trained on roughly $2\text{T}$ tokens using a stable, low-cost pre-training paradigm. It combines a FLM-101B-inspired backbone with RMSNorm, SwiGLU, and RoPE, a tailored BBPE tokenizer, and a 3D parallelism strategy guided by muP TP5 hyperparameter search to minimize trial-and-error. Benchmark results show Tele-FLM matches or surpasses larger, FLOP-intensive baselines on both English and Chinese tasks, demonstrating strong multilingual compression and robust knowledge and reasoning capabilities. By releasing weights, data composition, and training dynamics, the work aims to accelerate open-source LLM progress and promote greener, more efficient AI development.

Abstract

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Tele-FLM Technical Report

TL;DR

Tele-FLM addresses the challenge of scalable open LLMs by introducing a 52B multilingual decoder trained on roughly tokens using a stable, low-cost pre-training paradigm. It combines a FLM-101B-inspired backbone with RMSNorm, SwiGLU, and RoPE, a tailored BBPE tokenizer, and a 3D parallelism strategy guided by muP TP5 hyperparameter search to minimize trial-and-error. Benchmark results show Tele-FLM matches or surpasses larger, FLOP-intensive baselines on both English and Chinese tasks, demonstrating strong multilingual compression and robust knowledge and reasoning capabilities. By releasing weights, data composition, and training dynamics, the work aims to accelerate open-source LLM progress and promote greener, more efficient AI development.

Abstract

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.
Paper Structure (16 sections, 4 figures, 8 tables)

This paper contains 16 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Experimental curves of hyperparameter search based on $\mu$P.
  • Figure 2: Pre-training curves for Tele-FLM w.r.t. amount of data in billion tokens.
  • Figure 3: BPB curves of Tele-FLM on representative English (en), Chinese (zh), multi-language, and code validation datasets, compared with Llama series.
  • Figure 4: Evolution of performance evaluated by Language Model Evaluation Harness during training. Note that we sampled 20% examples for Hellswag and 30% examples for MMLU considering the time cost.