Table of Contents
Fetching ...

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Anna Golubeva, Vasu Shyam, Robert Washbourne, Rishi Iyer, Ansh Chaurasia, Tomas Figliolia, Xiao Yang, Abhinav Sarje, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge

TL;DR

The paper investigates large-scale MoE pretraining on an AMD-based platform (MI300X GPUs with Pollara networking), presenting a production-scale case study that covers system benchmarking, transformers sizing, and architecture design. It introduces ZAYA1-base, a 760M-active-parameters MoE model with Compressed Convolutional Attention, a more expressive ZAYA1 router, and residual scaling, trained in three phases with context extension up to 32k. The authors provide detailed hardware characterization, kernel and optimizer engineering (including Muon), and a fault-tolerance framework, demonstrating competitive performance against larger dense models and prior MoEs. Overall, the work establishes that AMD hardware stack is mature for frontier-scale pretraining and offers practical guidance for future AMD-centric LLM development.

Abstract

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE, available at https://huggingface.co/Zyphra/ZAYA1-base) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

TL;DR

The paper investigates large-scale MoE pretraining on an AMD-based platform (MI300X GPUs with Pollara networking), presenting a production-scale case study that covers system benchmarking, transformers sizing, and architecture design. It introduces ZAYA1-base, a 760M-active-parameters MoE model with Compressed Convolutional Attention, a more expressive ZAYA1 router, and residual scaling, trained in three phases with context extension up to 32k. The authors provide detailed hardware characterization, kernel and optimizer engineering (including Muon), and a fault-tolerance framework, demonstrating competitive performance against larger dense models and prior MoEs. Overall, the work establishes that AMD hardware stack is mature for frontier-scale pretraining and offers practical guidance for future AMD-centric LLM development.

Abstract

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE, available at https://huggingface.co/Zyphra/ZAYA1-base) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

Paper Structure

This paper contains 38 sections, 25 equations, 41 figures, 5 tables, 1 algorithm.

Figures (41)

  • Figure 1: The AMD software stack used to train ZAYA1, along with the respective languages and component libraries that each layer is written in. The principal hardware is described in Section \ref{['sec:cluster-setup']}. Our core training framework is a forked internal version of Megatron-LM adapted for the AMD stack.
  • Figure 3: The achievable memory bandwidth to HBM for PyTorch using the ROCm/CUDA backends.
  • Figure 8: The model architecture of ZAYA1. The two core innovations in architecture presented here are CCA for the attention block and the ZAYA1 router. The ZAYA1 router replaces the linear router with a more expressive one consisting of downprojection, EDA, and then three sequential MLPs per expert.
  • Figure 9: Schematic of the three phases of pretraining for ZAYA1-base. Data mixture, learning rate schedule, and context length are chosen for each phase so that the model is prepared for post-training. The core pretraining consists of two phases. The first phase inculcates general knowledge and linguistic understanding into the model through highly diverse corpora of primarily web-sourced data. The second phase begins to reinforce and strengthen the mathematics, coding, and STEM knowledge components through additional mixing of information-rich high-quality data. The final phase extends the context and further emphasizes STEM content as well as prepares the base for instruction-following and reasoning post-training.
  • Figure 11: The context-parallelism design used to train ZAYA1-base on longer context lengths.
  • ...and 36 more figures