Table of Contents
Fetching ...

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, Dayiheng Liu

TL;DR

This work tackles the fundamental conflict between matrix-based optimizers that require holistic access to weight matrices and distributed training stacks that shard parameters across DP and TP. Canzona decouples logical optimizer ownership from parameter distribution, introducing an alpha-Balanced Static Partitioning for DP and an Asynchronous Micro-Group Pipeline for TP to preserve atomicity and ZeRO geometry. Offline planning algorithms (alpha-Balanced Greedy LPT for DP and Micro-Group scheduling with greedy rollback for TP) produce load-balanced partition plans that enable zero-communication during optimizer steps and efficient interconnect usage. Experiments on Qwen3 models up to 32B parameters across 256 GPUs show up to 1.57x end-to-end speedups and 5.8x optimizer latency reductions, with convergence preserved, and demonstrate generality to Muon, Shampoo, and SOAP.

Abstract

The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

TL;DR

This work tackles the fundamental conflict between matrix-based optimizers that require holistic access to weight matrices and distributed training stacks that shard parameters across DP and TP. Canzona decouples logical optimizer ownership from parameter distribution, introducing an alpha-Balanced Static Partitioning for DP and an Asynchronous Micro-Group Pipeline for TP to preserve atomicity and ZeRO geometry. Offline planning algorithms (alpha-Balanced Greedy LPT for DP and Micro-Group scheduling with greedy rollback for TP) produce load-balanced partition plans that enable zero-communication during optimizer steps and efficient interconnect usage. Experiments on Qwen3 models up to 32B parameters across 256 GPUs show up to 1.57x end-to-end speedups and 5.8x optimizer latency reductions, with convergence preserved, and demonstrate generality to Muon, Shampoo, and SOAP.

Abstract

The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.
Paper Structure (58 sections, 7 equations, 16 figures, 1 table, 4 algorithms)

This paper contains 58 sections, 7 equations, 16 figures, 1 table, 4 algorithms.

Figures (16)

  • Figure 1: Comparison of Data Parallelism (DP) Partitioning Strategies.(Left) DP-SC: One way that can be directly used by those optimizers is DDP, which replicates optimizer states on all ranks, resulting in Redundant Compute where every rank performs identical matrix-based operations (synchronous). (Right) DP-ASC: Eliminates redundancy by partitioning states. Equal Chunk (Standard ZeRO-1, Gray Arrow and Line): Standard partitioning (e.g., for AdamW) slices the buffer into uniform shards ($|B|/R$). This arbitrary slicing (dashed lines) violates the atomicity required by matrix-based optimizers. Ours (Static Partitioning, Orange / Green Arrow): We enforce parameter atomicity by respecting tensor boundaries. Load Imbalance (Orange Arrow, Line, and Box): A naive atomic assignment leads to significant computational stragglers and communication bubbles (dashed box) due to varying parameter costs. Load Balance (Green Arrow, Line, and Box): Our $\alpha$-Balanced algorithm optimizes the static layout, redistributing whole parameters to equalize the workload across ranks. Note: In this figure, the blocks labeled $P$ represent the optimizer states and the associated update computation for those parameters. The parameters themselves remain replicated across ranks during the forward and backward passes (following the ZeRO-1 protocol).
  • Figure 2: Optimizer Update Workflow Comparison of Tensor Parallelism (TP) Strategies.(Left) TP-SC: An intuitive approach that relies on synchronous collective communication (All-Gather) and redundant computation (performing the same tensor operations on all TP ranks), which limits scalability and efficiency. (Right) TP-ASC: Our proposed strategy utilizing Micro-Group Scheduling. Micro Gradient Group: Gradients are aggregated into micro groups (labeled as $G$ in the figure) to saturate the All-to-All (dashed box) communication bandwidth, replacing the inefficient small-kernel calls. Load Balancing: Instead of fixed assignments, these groups are dynamically scheduled to Host Ranks. The distinct block lengths in the "Compute" phase illustrate our algorithm's ability to handle varying computational costs, minimizing the overall execution makespan.
  • Figure 3: Main Results:(a) Efficiency Comparison: Our LB-ASC strategy outperforms baselines by effectively eliminating computational bubbles and maximizing device utilization. (b) & (c) Load Balancing Analysis: The visualized load distributions demonstrate that our proposed scheduling algorithms successfully flatten the workload variance for both Tensor Parallelism (b) and Data Parallelism (c), significantly mitigating the straggler problem compared to naive partitioning.
  • Figure 4: End-to-End Iteration Time Comparison: Our framework significantly outperforms layerwise_optimizer. The performance advantage is driven by two factors: (1) the elimination of runtime communication during the optimizer step via our decoupled design, and (2) the preservation of the ZeRO-1 Geometric Constraint mentioned in Appendix \ref{['subsec:layerwise-geometric-conflict']}.
  • Figure 5: Precision Verification: SC Baseline & LB-ASC (Ours)
  • ...and 11 more figures