Table of Contents
Fetching ...

InstructionCP: A fast approach to transfer Large Language Models into target language

Kuang-Ming Chen, Hung-yi Lee

TL;DR

This paper presents Instruction Continual Pre-training (InsCP), a unified CP/SFT approach that injects instruction tags into continual pre-training to transfer English-centric LLMs to non-English languages without sacrificing conversational abilities or RLHF. By leveraging a chat-template augmented training objective and a small, carefully selected dataset (0.1B tokens for Traditional Chinese, with low-perplexity instruction data and high-perplexity wiki data), InsCP achieves strong language alignment in the target language while preserving English knowledge and safety properties. Across language alignment, reliability, and knowledge benchmarks, InsCP demonstrates comparable or improved performance relative to English-centric baselines and traditional CP pipelines, with MT-Bench confirming robust multi-turn and language-aware generation. The approach significantly reduces data requirements and computational resources, enabling faster cross-language deployment for LLMs, though it remains data-sensitive for languages with limited high-quality instruction content.

Abstract

The rapid development of large language models (LLMs) in recent years has largely focused on English, resulting in models that respond exclusively in English. To adapt these models to other languages, continual pre-training (CP) is often employed, followed by supervised fine-tuning (SFT) to maintain conversational abilities. However, CP and SFT can reduce a model's ability to filter harmful content. We propose Instruction Continual Pre-training (InsCP), which integrates instruction tags into the CP process to prevent loss of conversational proficiency while acquiring new languages. Our experiments demonstrate that InsCP retains conversational and Reinforcement Learning from Human Feedback (RLHF) abilities. Empirical evaluations on language alignment, reliability, and knowledge benchmarks confirm the efficacy of InsCP. Notably, this approach requires only 0.1 billion tokens of high-quality instruction-following data, thereby reducing resource consumption.

InstructionCP: A fast approach to transfer Large Language Models into target language

TL;DR

This paper presents Instruction Continual Pre-training (InsCP), a unified CP/SFT approach that injects instruction tags into continual pre-training to transfer English-centric LLMs to non-English languages without sacrificing conversational abilities or RLHF. By leveraging a chat-template augmented training objective and a small, carefully selected dataset (0.1B tokens for Traditional Chinese, with low-perplexity instruction data and high-perplexity wiki data), InsCP achieves strong language alignment in the target language while preserving English knowledge and safety properties. Across language alignment, reliability, and knowledge benchmarks, InsCP demonstrates comparable or improved performance relative to English-centric baselines and traditional CP pipelines, with MT-Bench confirming robust multi-turn and language-aware generation. The approach significantly reduces data requirements and computational resources, enabling faster cross-language deployment for LLMs, though it remains data-sensitive for languages with limited high-quality instruction content.

Abstract

The rapid development of large language models (LLMs) in recent years has largely focused on English, resulting in models that respond exclusively in English. To adapt these models to other languages, continual pre-training (CP) is often employed, followed by supervised fine-tuning (SFT) to maintain conversational abilities. However, CP and SFT can reduce a model's ability to filter harmful content. We propose Instruction Continual Pre-training (InsCP), which integrates instruction tags into the CP process to prevent loss of conversational proficiency while acquiring new languages. Our experiments demonstrate that InsCP retains conversational and Reinforcement Learning from Human Feedback (RLHF) abilities. Empirical evaluations on language alignment, reliability, and knowledge benchmarks confirm the efficacy of InsCP. Notably, this approach requires only 0.1 billion tokens of high-quality instruction-following data, thereby reducing resource consumption.
Paper Structure (24 sections, 3 equations, 1 figure, 5 tables)

This paper contains 24 sections, 3 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: n illustration to demonstrate the difference between the traditional approach and our method. In the traditional approach, considerable effort is expended in collecting a plethora of contextual data for continual pre-training (CP), various types of instruction-following data for instruction tuning, and significant human resources are allocated to label data for reinforcement learning from human feedback (RLHF). However, with our method, Instruction Continual Pre-training (InsCP), these processes are streamlined into a single step