Conda: Column-Normalized Adam for Training Large Language Models Faster
Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin
TL;DR
Conda targets the spectral inefficiencies of Adam-based updates in large transformers by integrating column-wise spectral normalization with Adam-like adaptivity. It does so by performing an SVD-based subspace projection on the first moment, applying column-specific second-moment normalization to projected gradients, and updating parameters with a mild, subspace-aware normalization. Empirical results on LLaMA and GPT-2 demonstrate 2–2.5× faster convergence in both steps and time, with robust improvements in perplexity and downstream accuracy across model scales and fine-tuning tasks. This approach offers a practical, scalable optimizer for efficient large-scale LLM training and opens avenues for further exploration of subspace-aligned second-moment estimators.
Abstract
Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose Column-Normalized Adam (Conda), a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, Conda achieves 2-2.5 the convergence speed of AdamW, measured in both training steps and training time. Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda
