Orion-14B: Open-source Multilingual Large Language Models
Du Chen, Yi Huang, Xiaopu Li, Yongqiang Li, Yongqiang Liu, Haihui Pan, Leichao Xu, Dacheng Zhang, Zhipeng Zhang, Kun Han
TL;DR
Orion-14B presents a multilingual 14B-parameter LLM trained on 2.5T tokens using a data-scheduling curriculum with staged language and content complexity. It provides a base model and Chat-oriented fine-tuned variants, and evaluates them across standard, multilingual, and retrieval-augmented tasks, reporting state-of-the-art results in several benchmarks. The work details data collection, quality filtering, deduplication, tokenizer choices, architectural tweaks, and a 2.5T-tokens pretraining regime, plus supervised fine-tuning and evaluation methodologies. It also introduces extension models (long-context, quantized, RAG, plugins) and openly releases code and models to foster reproducibility and broader adoption.
Abstract
In this study, we introduce Orion-14B, a collection of multilingual large language models with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.
