An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale

Adrián Javaloy; Antonio Vergari

An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale

Adrián Javaloy, Antonio Vergari

TL;DR

POGO sets a milestone to finally exploit orthogonality constraints in ML at scale and greatly outperforms recent optimizers and shows it can optimize problems with thousands of orthogonal matrices in minutes while alternatives would take hours.

Abstract

Orthogonality constraints are ubiquitous in robust and probabilistic machine learning. Unfortunately, current optimizers are computationally expensive and do not scale to problems with hundreds or thousands of constraints. One notable exception is the Landing algorithm (Ablin et al., 2024) which, however comes at the expense of temporarily relaxing orthogonality. In this work, we revisit and improve on the ideas behind Landing, enabling the inclusion of modern adaptive optimizers while ensuring that orthogonal constraints are effectively met. Remarkably, these improvements come at little to no cost, and reduce the number of required hyperparemeters. Our algorithm POGO is fast and GPU-friendly, consisting of only 5 matrix products, and in practice maintains orthogonality at all times. On several challenging benchmarks, POGO greatly outperforms recent optimizers and shows it can optimize problems with thousands of orthogonal matrices in minutes while alternatives would take hours. As such, POGO sets a milestone to finally exploit orthogonality constraints in ML at scale. A PyTorch implementation of POGO is publicly available at https://github.com/adrianjav/pogo.

An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale

TL;DR

Abstract

Paper Structure (49 sections, 14 theorems, 58 equations, 11 figures, 1 algorithm)

This paper contains 49 sections, 14 theorems, 58 equations, 11 figures, 1 algorithm.

Introduction
Contributions.
Optimization on the Stiefel Manifold
The Landing Algorithm
Proximal One-step Geometric Optimization
Adopting Modern Optimizers
Leap, Land, Repeat
Intermediate step.
Landing polynomial.
Choosing a step size.
A Surprising Approximation
Intuition.
Convergence.
Summary and Practical Considerations
Computational cost.
...and 34 more sections

Key Result

Lemma 3.1

Let $P(\lambda)$ be as defined above, then $P(\lambda)$ is a quartic polynomial * $\lambda$. Specifically, where $\mC \coloneqq \mM_t\mM_t^\top\!-\!\mI$, $\mE\coloneq \nabla\!\mathop{\mathrm{\mathcal{N}}}\nolimits(\mM_t)\nabla\!\mathop{\mathrm{\mathcal{N}}}\nolimits(\mM_t)^\top$ and $\mD \coloneqq \mM_t\nabla\!\mathop{\mathrm{\mathcal{N}}}\nolimits(\mM_t)^\top\!+\!\nabla\!\mathop{\mathrm{\mathcal

Figures (11)

Figure 1: POGO optimizes thousands of orthogonal matrices orders of magnitude faster than retraction methods while achieving performance comparable to unconstrained optimizers, as shown in this CIFAR-10 krizhevsky2009learning classification problem with a tailored CNN jordan202494 parameterized with orthogonal filters or kernels. While RSDM han2025efficient takes 17 hours to train on average, POGO trains in 3 minutes.
Figure 2: Illustration of the landing algorithm, adapted from ablin2024infeasible. Landing combines two orthogonal gradients at each iteration and adapts the learning rate $\eta$ to keep $\mX_1$ within $\varepsilon$-distance from the Stiefel manifold.
Figure 3: Illustration of the POGO algorithm, see \ref{['sec:methodology']}. Computing the distance * the intermediate point $\mM$, POGO can calculate the exact $\lambda$ to stay within the manifold.
Figure 4: POGO reduces the optimality gap the fastest across all baselines while staying on the manifold. Results are averaged over 10.0 independent runs and the orthogonal matrices are of size $1500\times 2000$ for PCA and $2000\times 2000$ for Procrustes.
Figure 5: While all methods obtain similar test accuracies with an O-ViT fei2022vit on CIFAR-10 krizhevsky2009learning, POGO is the fastest method to complete 10.0 epochs without leaving the manifold. Results show average and 95% confidence intervals over 5.0 independent runs.
...and 6 more figures

Theorems & Definitions (24)

Definition 1
Lemma 3.1
Proposition 3.2
Proposition 3.3
Theorem 3.4
Theorem 3.5
Lemma 1.1
proof
Proposition 1.2
proof
...and 14 more

An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale

TL;DR

Abstract

An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (24)