Arcee Trinity Large Technical Report

Varun Singh; Lucas Krauss; Sami Jaghouar; Matej Sirovatka; Charles Goddard; Fares Obied; Jack Min Ong; Jannik Straube; Fern; Aria Harley; Conner Stewart; Colin Kealty; Maziyar Panahi; Simon Kirsten; Anushka Deshpande; Anneketh Vij; Arthur Bresnu; Pranav Veldurthi; Raghav Ravishankar; Hardik Bishnoi; DatologyAI Team; Arcee AI Team; Prime Intellect Team; Mark McQuade; Johannes Hagemann; Lucas Atkins

Arcee Trinity Large Technical Report

Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Fern, Aria Harley, Conner Stewart, Colin Kealty, Maziyar Panahi, Simon Kirsten, Anushka Deshpande, Anneketh Vij, Arthur Bresnu, Pranav Veldurthi, Raghav Ravishankar, Hardik Bishnoi, DatologyAI Team, Arcee AI Team, Prime Intellect Team, Mark McQuade, Johannes Hagemann, Lucas Atkins

TL;DR

Trinity Large introduces a flagship open-weight sparse Mixture-of-Experts language model with $400\mathrm{B}$ parameters and $13\mathrm{B}$ active per token, supported by Trinity Nano and Trinity Mini as scaling ladders. The architecture combines interleaved local/global attention, gated attention, depth-scaled sandwich normalization, and sigmoid MoE routing, with a new Soft-clamped Momentum Expert Bias Updates (SMEBU) load-balancing strategy and the Muon optimizer for training stability. The report details tokenizer design, MoE routing, normalization, and long-context extension, and presents capability and inference benchmarks showing competitive performance under FP8 quantization, along with a comprehensive discussion of training stability and post-training plans. The work demonstrates the practicality of large open-weight MoE deployment with advanced load balancing and long-context capabilities for enterprise and research applications.

Abstract

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.

Arcee Trinity Large Technical Report

TL;DR

Trinity Large introduces a flagship open-weight sparse Mixture-of-Experts language model with

parameters and

active per token, supported by Trinity Nano and Trinity Mini as scaling ladders. The architecture combines interleaved local/global attention, gated attention, depth-scaled sandwich normalization, and sigmoid MoE routing, with a new Soft-clamped Momentum Expert Bias Updates (SMEBU) load-balancing strategy and the Muon optimizer for training stability. The report details tokenizer design, MoE routing, normalization, and long-context extension, and presents capability and inference benchmarks showing competitive performance under FP8 quantization, along with a comprehensive discussion of training stability and post-training plans. The work demonstrates the practicality of large open-weight MoE deployment with advanced load balancing and long-context capabilities for enterprise and research applications.

Abstract

Paper Structure (24 sections, 21 equations, 4 figures, 4 tables)

This paper contains 24 sections, 21 equations, 4 figures, 4 tables.

Introduction
Architecture
Tokenizer
Pretokenization
Vocabulary Size
Tokenizer Efficiency
Attention
Mixture-of-Experts
Normalization
Initialization
Pre-training
Data
Data Preparation
Infrastructure
Hyperparameters
...and 9 more sections

Figures (4)

Figure 1: The training loss graph for Trinity Large, with no sub-sampling or smoothing. For clarity, we indicate where the batch size was increased to 128M (134,217,728) tokens, as well as the points where we switch data mixtures.
Figure 2: The architecture of the Trinity model family. $^*$ RoPE is only present in local layers. $^\dagger$ The grouped-query attention has a sliding window for the local layers.
Figure 3: A comparison of Trinity Large Base to other similar open-weight base models.
Figure 4: Throughput comparison of models. All tests were done with models quantized to FP8, using vLLM, on 8xH200.

Arcee Trinity Large Technical Report

TL;DR

Abstract

Arcee Trinity Large Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (4)