Table of Contents
Fetching ...

Conda: Column-Normalized Adam for Training Large Language Models Faster

Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin

TL;DR

Conda targets the spectral inefficiencies of Adam-based updates in large transformers by integrating column-wise spectral normalization with Adam-like adaptivity. It does so by performing an SVD-based subspace projection on the first moment, applying column-specific second-moment normalization to projected gradients, and updating parameters with a mild, subspace-aware normalization. Empirical results on LLaMA and GPT-2 demonstrate 2–2.5× faster convergence in both steps and time, with robust improvements in perplexity and downstream accuracy across model scales and fine-tuning tasks. This approach offers a practical, scalable optimizer for efficient large-scale LLM training and opens avenues for further exploration of subspace-aligned second-moment estimators.

Abstract

Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose Column-Normalized Adam (Conda), a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, Conda achieves 2-2.5 the convergence speed of AdamW, measured in both training steps and training time. Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda

Conda: Column-Normalized Adam for Training Large Language Models Faster

TL;DR

Conda targets the spectral inefficiencies of Adam-based updates in large transformers by integrating column-wise spectral normalization with Adam-like adaptivity. It does so by performing an SVD-based subspace projection on the first moment, applying column-specific second-moment normalization to projected gradients, and updating parameters with a mild, subspace-aware normalization. Empirical results on LLaMA and GPT-2 demonstrate 2–2.5× faster convergence in both steps and time, with robust improvements in perplexity and downstream accuracy across model scales and fine-tuning tasks. This approach offers a practical, scalable optimizer for efficient large-scale LLM training and opens avenues for further exploration of subspace-aligned second-moment estimators.

Abstract

Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose Column-Normalized Adam (Conda), a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, Conda achieves 2-2.5 the convergence speed of AdamW, measured in both training steps and training time. Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda

Paper Structure

This paper contains 18 sections, 4 theorems, 18 equations, 7 figures, 15 tables, 1 algorithm.

Key Result

Lemma 1

For Muon in Eqn. muon, it can be reformulated into the following equivalent one: where $\mathrm{diag}(\mathbf{\Sigma}_{t})$ maps the singular values into a vector of dimension $\mathbb{R}^{\min(m,n)}$, and $\mathbf{1}\in\mathbb{R}^{n}$ denotes a vector whose entries are always ones.

Figures (7)

  • Figure 1: Spectral analysis of optimizer updates on LLaMA-60M (a, b) and LLaMA-350M (c, d). (a, c): $\log_e$ condition number of 2D update matrices over training steps. (b, d): Distribution of $\log_{10}$ all singular values of 2D update matrices at the end of training.
  • Figure 2: Validation loss curves for LLaMA models over training steps (top) and time (bottom).
  • Figure 3: Validation loss curves for GPT2 models over training steps and time.
  • Figure 4: (a) Zero-shot average accuracy on downstream tasks plotted against training steps. (b) Same as (a), but plotted against training time. (c) Validation loss curve on LLaMA-1B with sequence length 1024. (d) Perplexity ($\downarrow$) under different subspace update frequencies $T$.
  • Figure 4: Perplexity ($\downarrow$) comparison with and without subspace projection in Conda.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Lemma 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof