Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Liangyu Wang; Siqi Zhang; Junjie Wang; Yiming Dong; Bo Zheng; Zihan Qiu; Shengkun Tang; Di Wang; Rui Men; Dayiheng Liu

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, Dayiheng Liu

TL;DR

This work tackles the fundamental conflict between matrix-based optimizers that require holistic access to weight matrices and distributed training stacks that shard parameters across DP and TP. Canzona decouples logical optimizer ownership from parameter distribution, introducing an alpha-Balanced Static Partitioning for DP and an Asynchronous Micro-Group Pipeline for TP to preserve atomicity and ZeRO geometry. Offline planning algorithms (alpha-Balanced Greedy LPT for DP and Micro-Group scheduling with greedy rollback for TP) produce load-balanced partition plans that enable zero-communication during optimizer steps and efficient interconnect usage. Experiments on Qwen3 models up to 32B parameters across 256 GPUs show up to 1.57x end-to-end speedups and 5.8x optimizer latency reductions, with convergence preserved, and demonstrate generality to Muon, Shampoo, and SOAP.

Abstract

The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

TL;DR

Abstract

Paper Structure (58 sections, 7 equations, 16 figures, 1 table, 4 algorithms)

This paper contains 58 sections, 7 equations, 16 figures, 1 table, 4 algorithms.

Introduction
Preliminary
Load-Balanced Asynchronous Compute for Data Parallelism Matrix-based Optimizer
Design Paradigm Analysis: Why Static Layout?
Load-Balance Optimization
System Workflow: Static-Layout Enforcement
Tensor Parallelism with Load-Balanced Asynchronous Compute
Task Abstraction: The Asynchronous Compute Unit
Workload Scheduling: Hierarchical Partitioning
Canzona: Framework for Unified, Asynchronous, and Load-Balanced Matrix-based Optimizer
Experiment
Experiment Setup
Main Results
Precision Verification
Related Work
...and 43 more sections

Figures (16)

Figure 1: Comparison of Data Parallelism (DP) Partitioning Strategies.(Left) DP-SC: One way that can be directly used by those optimizers is DDP, which replicates optimizer states on all ranks, resulting in Redundant Compute where every rank performs identical matrix-based operations (synchronous). (Right) DP-ASC: Eliminates redundancy by partitioning states. Equal Chunk (Standard ZeRO-1, Gray Arrow and Line): Standard partitioning (e.g., for AdamW) slices the buffer into uniform shards ($|B|/R$). This arbitrary slicing (dashed lines) violates the atomicity required by matrix-based optimizers. Ours (Static Partitioning, Orange / Green Arrow): We enforce parameter atomicity by respecting tensor boundaries. Load Imbalance (Orange Arrow, Line, and Box): A naive atomic assignment leads to significant computational stragglers and communication bubbles (dashed box) due to varying parameter costs. Load Balance (Green Arrow, Line, and Box): Our $\alpha$-Balanced algorithm optimizes the static layout, redistributing whole parameters to equalize the workload across ranks. Note: In this figure, the blocks labeled $P$ represent the optimizer states and the associated update computation for those parameters. The parameters themselves remain replicated across ranks during the forward and backward passes (following the ZeRO-1 protocol).
Figure 2: Optimizer Update Workflow Comparison of Tensor Parallelism (TP) Strategies.(Left) TP-SC: An intuitive approach that relies on synchronous collective communication (All-Gather) and redundant computation (performing the same tensor operations on all TP ranks), which limits scalability and efficiency. (Right) TP-ASC: Our proposed strategy utilizing Micro-Group Scheduling. Micro Gradient Group: Gradients are aggregated into micro groups (labeled as $G$ in the figure) to saturate the All-to-All (dashed box) communication bandwidth, replacing the inefficient small-kernel calls. Load Balancing: Instead of fixed assignments, these groups are dynamically scheduled to Host Ranks. The distinct block lengths in the "Compute" phase illustrate our algorithm's ability to handle varying computational costs, minimizing the overall execution makespan.
Figure 3: Main Results:(a) Efficiency Comparison: Our LB-ASC strategy outperforms baselines by effectively eliminating computational bubbles and maximizing device utilization. (b) & (c) Load Balancing Analysis: The visualized load distributions demonstrate that our proposed scheduling algorithms successfully flatten the workload variance for both Tensor Parallelism (b) and Data Parallelism (c), significantly mitigating the straggler problem compared to naive partitioning.
Figure 4: End-to-End Iteration Time Comparison: Our framework significantly outperforms layerwise_optimizer. The performance advantage is driven by two factors: (1) the elimination of runtime communication during the optimizer step via our decoupled design, and (2) the preservation of the ZeRO-1 Geometric Constraint mentioned in Appendix \ref{['subsec:layerwise-geometric-conflict']}.
Figure 5: Precision Verification: SC Baseline & LB-ASC (Ours)
...and 11 more figures

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

TL;DR

Abstract

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Authors

TL;DR

Abstract

Table of Contents

Figures (16)