Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers
Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, Dayiheng Liu
TL;DR
This work tackles the fundamental conflict between matrix-based optimizers that require holistic access to weight matrices and distributed training stacks that shard parameters across DP and TP. Canzona decouples logical optimizer ownership from parameter distribution, introducing an alpha-Balanced Static Partitioning for DP and an Asynchronous Micro-Group Pipeline for TP to preserve atomicity and ZeRO geometry. Offline planning algorithms (alpha-Balanced Greedy LPT for DP and Micro-Group scheduling with greedy rollback for TP) produce load-balanced partition plans that enable zero-communication during optimizer steps and efficient interconnect usage. Experiments on Qwen3 models up to 32B parameters across 256 GPUs show up to 1.57x end-to-end speedups and 5.8x optimizer latency reductions, with convergence preserved, and demonstrate generality to Muon, Shampoo, and SOAP.
Abstract
The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.
