Table of Contents
Fetching ...

TeleChat Technical Report

Zhongjiang He, Zihan Wang, Xinzhang Liu, Shixuan Liu, Yitong Yao, Yuyao Huang, Xuelong Li, Yongxiang Li, Zhonghao Che, Zhaoxi Zhang, Yan Wang, Xin Wang, Luwen Pu, Huinan Xu, Ruiyu Fang, Yu Zhao, Jie Zhang, Xiaomeng Huang, Zhilong Lu, Jiaxin Peng, Wenjun Zheng, Shiquan Wang, Bingkai Yang, Xuewei he, Zhuoru Jiang, Qiyi Xie, Yanhan Zhang, Zhongqiu Li, Lingling Shi, Weiwei Fu, Yin Zhang, Zilu Huang, Sishi Xiong, Yuxiang Zhang, Chao Wang, Shuangyong Song

TL;DR

TeleChat proposes an open, multilingual LLM family (3B, 7B, 12B) pretrained on a trillion-token English-Chinese corpus and further aligned via supervised fine-tuning and reinforcement learning. It introduces innovations to enable a 96k token context, including NTK-aware interpolation, multi-stage long-context training, and LogN-scaling, along with Noisy Embedding Fine Tuning to improve data-efficient generalization. The paper provides extensive details on data curation, safety, and distributed training (Megatron-DeepSpeed), and demonstrates competitive performance across language understanding, reasoning, and coding benchmarks, while also showing significant hallucination mitigation through knowledge-graph augmentation. By releasing fine-tuned 7B and 12B checkpoints, code, and a portion of pretraining data, TeleChat aims to enhance reproducibility and accelerate research and applications of open LLMs in bilingual contexts and real-world tasks.

Abstract

In this technical report, we present TeleChat, a collection of large language models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained language models as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.

TeleChat Technical Report

TL;DR

TeleChat proposes an open, multilingual LLM family (3B, 7B, 12B) pretrained on a trillion-token English-Chinese corpus and further aligned via supervised fine-tuning and reinforcement learning. It introduces innovations to enable a 96k token context, including NTK-aware interpolation, multi-stage long-context training, and LogN-scaling, along with Noisy Embedding Fine Tuning to improve data-efficient generalization. The paper provides extensive details on data curation, safety, and distributed training (Megatron-DeepSpeed), and demonstrates competitive performance across language understanding, reasoning, and coding benchmarks, while also showing significant hallucination mitigation through knowledge-graph augmentation. By releasing fine-tuned 7B and 12B checkpoints, code, and a portion of pretraining data, TeleChat aims to enhance reproducibility and accelerate research and applications of open LLMs in bilingual contexts and real-world tasks.

Abstract

In this technical report, we present TeleChat, a collection of large language models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained language models as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.
Paper Structure (35 sections, 1 equation, 2 figures, 8 tables)

This paper contains 35 sections, 1 equation, 2 figures, 8 tables.

Figures (2)

  • Figure 1: The overall process of introducing knowledge into prompts.
  • Figure 2: Illustration of the Top 30 categories in our SFT data.