Table of Contents
Fetching ...

Trillion 7B Technical Report

Sungjun Han, Juyoung Suk, Suyeong An, Hyungguk Kim, Kyuseok Kim, Wonsuk Yang, Seungtaek Choi, Jamin Shin

TL;DR

This work tackles data imbalance in multilingual LLMs by introducing XLDA, a Cross-lingual Document Attention mechanism that enables efficient cross-language transfer from English to Korean and beyond. Through strategic batch packing, selective attention masking, two-stage pretraining, careful data filtering, tailored tokenization, and long-context extension, Trillion-7B achieves strong multilingual performance with only about 10% multilingual data and modest compute (roughly 59.4K H100 hours). The paper provides extensive evaluations across 27 benchmarks in four languages, plus ablations and cross-lingual analyses that demonstrate robust cross-lingual consistency and transfer to vision tasks. These results suggest that architectural innovations can significantly reduce data and compute requirements while delivering high-quality multilingual capabilities, with clear paths for future multimodal extension and larger-scale models.

Abstract

We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\$148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.

Trillion 7B Technical Report

TL;DR

This work tackles data imbalance in multilingual LLMs by introducing XLDA, a Cross-lingual Document Attention mechanism that enables efficient cross-language transfer from English to Korean and beyond. Through strategic batch packing, selective attention masking, two-stage pretraining, careful data filtering, tailored tokenization, and long-context extension, Trillion-7B achieves strong multilingual performance with only about 10% multilingual data and modest compute (roughly 59.4K H100 hours). The paper provides extensive evaluations across 27 benchmarks in four languages, plus ablations and cross-lingual analyses that demonstrate robust cross-lingual consistency and transfer to vision tasks. These results suggest that architectural innovations can significantly reduce data and compute requirements while delivering high-quality multilingual capabilities, with clear paths for future multimodal extension and larger-scale models.

Abstract

We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\$148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.

Paper Structure

This paper contains 50 sections, 4 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Trillion-7B significantly advances the Pareto-frontier across all aspects.
  • Figure 2: Cross-Lingual Document Attention. A multilingual batch (left) is packed so that each sequence contains contiguous spans from at least two languages (e.g. English + Korean). The XLDA mask (centre) keeps full self‑attention across language blocks (blue cells) while standard causal mask (right) blocks attention across document boundaries (grey cells).
  • Figure 3: Discrepancy in scaling curves of Llama. The above plots suggest that brute-force scaling (by Llama 2 & 3) results in huge performance gaps between English and Korean, whereas Trillion-7B shows more desirable scaling laws for Korean performance closing the wide gap.
  • Figure 4: Proxy model and emergence point. We trained 1.8B parameter models on approximately 100 billion tokens to serve as proxy models for determining optimal training configurations. This specific configuration, represented by a red star in the figure, identifies the most FLOP-efficient setting at which downstream task improvements become observable.
  • Figure 5: Average Korean throughput measured on 1,000 selected Korean documents using vLLM. We choose a Korean vocabulary size of 24,552 tokens, surpassing the scaling-law optimal size of around 13,000 tokens, yet still positioned just before the plateau in inference speed gains for Korean. This decision strategically balances theoretical optimality against practical improvements in inference speed.
  • ...and 5 more figures