Table of Contents
Fetching ...

Memory Analysis on the Training Course of DeepSeek Models

Ping Zhang, Lei Su

TL;DR

This work analyzes device-level memory demands for training large DeepSeek models under distributed configurations, clarifying how micro-batch size, activation recomputation, 3D parallelism, and ZeRO optimizations shape memory. It provides a structured memory-facing characterization across architecture (61-layer DeepSeek-v3), parameter counting (including MoE-dominated components), static-parameter analysis under stage-wise pipeline parallelism, and activation-memory behavior with and without recomputation. Key findings show MoE components dominate memory, static per-device parameter memory around 11.64 GB under the reference setup, substantial per-stage memory in pipeline parallelism (notably 86 GB per stage for Stages 1–14), and meaningful memory reductions achievable via ZeRO strategies and activation recomputation, albeit with compute and complexity trade-offs. The analysis also accounts for practical overheads such as memory fragmentation and inter-GPU buffers, informing memory-efficient configurations for training large MoE models.

Abstract

We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.

Memory Analysis on the Training Course of DeepSeek Models

TL;DR

This work analyzes device-level memory demands for training large DeepSeek models under distributed configurations, clarifying how micro-batch size, activation recomputation, 3D parallelism, and ZeRO optimizations shape memory. It provides a structured memory-facing characterization across architecture (61-layer DeepSeek-v3), parameter counting (including MoE-dominated components), static-parameter analysis under stage-wise pipeline parallelism, and activation-memory behavior with and without recomputation. Key findings show MoE components dominate memory, static per-device parameter memory around 11.64 GB under the reference setup, substantial per-stage memory in pipeline parallelism (notably 86 GB per stage for Stages 1–14), and meaningful memory reductions achievable via ZeRO strategies and activation recomputation, albeit with compute and complexity trade-offs. The analysis also accounts for practical overheads such as memory fragmentation and inter-GPU buffers, informing memory-efficient configurations for training large MoE models.

Abstract

We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.

Paper Structure

This paper contains 17 sections, 13 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Illustration of the basic architecture of DeepSeek-v3 liu2024deepseekv3
  • Figure 2: Activation pattern of MLA
  • Figure 3: Activation pattern of MoE linear