Table of Contents
Fetching ...

Compass-V2 Technical Report

Sophia Maria

TL;DR

Compass-v2 addresses the underrepresentation of Southeast Asian languages in large language models by training from scratch on a SEA- and e-commerce–oriented corpus using a fine-grained MoE architecture with 30B total and 5B active parameters. The approach blends fast shallow reasoning with deep long-context reasoning through a hybrid MoE design, multi-stage pre-training (including long-context extension) and a two-stage supervised fine-tuning regime aligned with direct preference optimization. It is complemented by targeted data pipelines for multilingual and e-commerce instructions, a SEA-optimized tokenizer, and quantization strategies (FP8 and AWQ) to enable efficient real-world deployment via Shopee CAP. Empirical results show competitive performance against larger models on SEA multilingual and e-commerce benchmarks, and strong in-house results close to top-tier models, while delivering favorable efficiency-to-performance trade-offs. The work demonstrates the practicality of sparse-activated architectures for region-specific NLP applications and lays groundwork for broader multilingual and enterprise-scale deployments in SEA markets.

Abstract

Predominant LLMs focus on high-resource languages while leaving low-resource languages, particularly those in Southeast Asia (SEA), underrepresented. In addition, those models are general-purpose and pay limited attention to the e-commerce domain. To overcome these limitations, we introduce Compass-v2, a lightweight Mixture-of-Experts (MoE) model specifically designed for Southeast Asian languages and e-commerce applications. To balance model performance and inference cost, the model is designed with 30B total parameters and 5B active parameters, incorporating both fine-grained and shared expert modules. To enhance multilingual performance, we curated and constructed a high-quality, industry-leading SEA dataset, to the best of our knowledge. To boost performance in the e-commerce domain, we built a dataset comprising hundreds of billions of tokens, sourced through external data mining and internal platform collection. Besides, we pioneered a hybrid reasoning model that supports both fast thinking and deep thinking within a unified framework to enhance the reasoning capabilities, diverging from the conventional industry practice of deploying two separate models. Through extensive experimental evaluations, our model demonstrates state-of-the-art SEA multilingual and e-commerce performance among sub-30B models, while maintaining significantly lower inference cost.

Compass-V2 Technical Report

TL;DR

Compass-v2 addresses the underrepresentation of Southeast Asian languages in large language models by training from scratch on a SEA- and e-commerce–oriented corpus using a fine-grained MoE architecture with 30B total and 5B active parameters. The approach blends fast shallow reasoning with deep long-context reasoning through a hybrid MoE design, multi-stage pre-training (including long-context extension) and a two-stage supervised fine-tuning regime aligned with direct preference optimization. It is complemented by targeted data pipelines for multilingual and e-commerce instructions, a SEA-optimized tokenizer, and quantization strategies (FP8 and AWQ) to enable efficient real-world deployment via Shopee CAP. Empirical results show competitive performance against larger models on SEA multilingual and e-commerce benchmarks, and strong in-house results close to top-tier models, while delivering favorable efficiency-to-performance trade-offs. The work demonstrates the practicality of sparse-activated architectures for region-specific NLP applications and lays groundwork for broader multilingual and enterprise-scale deployments in SEA markets.

Abstract

Predominant LLMs focus on high-resource languages while leaving low-resource languages, particularly those in Southeast Asia (SEA), underrepresented. In addition, those models are general-purpose and pay limited attention to the e-commerce domain. To overcome these limitations, we introduce Compass-v2, a lightweight Mixture-of-Experts (MoE) model specifically designed for Southeast Asian languages and e-commerce applications. To balance model performance and inference cost, the model is designed with 30B total parameters and 5B active parameters, incorporating both fine-grained and shared expert modules. To enhance multilingual performance, we curated and constructed a high-quality, industry-leading SEA dataset, to the best of our knowledge. To boost performance in the e-commerce domain, we built a dataset comprising hundreds of billions of tokens, sourced through external data mining and internal platform collection. Besides, we pioneered a hybrid reasoning model that supports both fast thinking and deep thinking within a unified framework to enhance the reasoning capabilities, diverging from the conventional industry practice of deploying two separate models. Through extensive experimental evaluations, our model demonstrates state-of-the-art SEA multilingual and e-commerce performance among sub-30B models, while maintaining significantly lower inference cost.

Paper Structure

This paper contains 46 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 3: Industry Overview of Pretraining Dataset Scale.
  • Figure 4: Multilingual Corpus Construction
  • Figure 5: Pretraining Data Processing Overview.
  • Figure 6: Compression Ratio on different languages of Compass-v2. Compass-v2 tokenizer achieves optimal compression rate for SEA languages.
  • Figure 7: Data volume of Compass-v2 dataset, versus previous generations of Compass training data and leading models
  • ...and 9 more figures