Table of Contents
Fetching ...

Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu

TL;DR

Seed-X tackles open-source multilingual translation with a 7B LLM family (instruct and RL variants) trained on 28 languages. It combines a three-stage pre-training pipeline with large-scale monolingual and bilingual data, followed by translation-focused supervised fine-tuning and PPO-based reinforcement learning guided by MT-specific rewards. Across FLORES-200, WMT-25, and Seed-X-Challenge, Seed-X rivals Tier-1 ultra-large models on automatic metrics and achieves strong human performance, all while remaining open-source and scalable. The work provides actionable data-quality, prompting, and training strategies to push small- to mid-size LLMs toward competitive multilingual translation capabilities, potentially accelerating research and deployment in low-resource languages.

Abstract

Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

TL;DR

Seed-X tackles open-source multilingual translation with a 7B LLM family (instruct and RL variants) trained on 28 languages. It combines a three-stage pre-training pipeline with large-scale monolingual and bilingual data, followed by translation-focused supervised fine-tuning and PPO-based reinforcement learning guided by MT-specific rewards. Across FLORES-200, WMT-25, and Seed-X-Challenge, Seed-X rivals Tier-1 ultra-large models on automatic metrics and achieves strong human performance, all while remaining open-source and scalable. The work provides actionable data-quality, prompting, and training strategies to push small- to mid-size LLMs toward competitive multilingual translation capabilities, potentially accelerating research and deployment in low-resource languages.

Abstract

Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

Paper Structure

This paper contains 29 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Benchmark performance on Flores-200 of Seed-X and its counterparts.
  • Figure 2: Model size versus average multilingual translation performance for different-scale open-source models. Scores averaged over directions and test sets (see Table \ref{['tab:main_results']}).
  • Figure 3: Human evaluation (0-4) of models on Seed-X-Challenge for Chinese/English to 7 languages translation (detailed scores are presented in Appendix \ref{['appendix:human_evaluation']}).
  • Figure 4: Performance gains of revising and filtering parallel data on encoder-decoder models and LLMs.
  • Figure 5: Performance changes on benchmarks in different stages of pretraining (Stage I: mixed corpora with an increased ratio of parallel data, Stage II: parallel data only). English-distant languages include Japanese, Thai, Russian, Malay, Indonesian, and Arabic, while English-similar languages refer to German, Spanish, Portuguese, and French.
  • ...and 2 more figures