Table of Contents
Fetching ...

PoLAR: Polar-Decomposed Low-Rank Adapter Representation

Kai Lion, Liang Zhang, Bingcong Li, Niao He

TL;DR

PoLAR tackles the underutilization of subspace in low-rank adapters used for fine-tuning large language models by introducing a polar-decomposed representation that enforces directional orthogonality via Stiefel-manifold factors. By parameterizing updates as $\Delta W = X \Theta Y^\top$ with $X,Y$ on the Stiefel manifold and $\Theta$ unconstrained, and optimizing with a landing-field approach, PoLAR achieves exponentially faster convergence on a canonical low-rank problem and yields consistent gains across models from 350M to 27B parameters on tasks spanning commonsense reasoning, mathematical problem solving, and natural language understanding. Empirical results show PoLAR increases the stable rank of updates, mitigates directional-diversity collapse observed in LoRA, and delivers improved accuracy versus LoRA/DoRA across benchmarks while offering practical runtime benefits on GPUs. The combination of architecture-optimizer co-design and infeasible manifold optimization enables scalable, parameter-efficient fine-tuning with tangible improvements in both performance and efficiency. Future work includes deeper spectral analysis of PoLAR dynamics and broader application beyond NLP.

Abstract

We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.

PoLAR: Polar-Decomposed Low-Rank Adapter Representation

TL;DR

PoLAR tackles the underutilization of subspace in low-rank adapters used for fine-tuning large language models by introducing a polar-decomposed representation that enforces directional orthogonality via Stiefel-manifold factors. By parameterizing updates as with on the Stiefel manifold and unconstrained, and optimizing with a landing-field approach, PoLAR achieves exponentially faster convergence on a canonical low-rank problem and yields consistent gains across models from 350M to 27B parameters on tasks spanning commonsense reasoning, mathematical problem solving, and natural language understanding. Empirical results show PoLAR increases the stable rank of updates, mitigates directional-diversity collapse observed in LoRA, and delivers improved accuracy versus LoRA/DoRA across benchmarks while offering practical runtime benefits on GPUs. The combination of architecture-optimizer co-design and infeasible manifold optimization enables scalable, parameter-efficient fine-tuning with tangible improvements in both performance and efficiency. Future work includes deeper spectral analysis of PoLAR dynamics and broader application beyond NLP.

Abstract

We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.

Paper Structure

This paper contains 55 sections, 25 theorems, 94 equations, 11 figures, 10 tables, 3 algorithms.

Key Result

Lemma 1

Let $\beta_t:= \sigma_1(\mathbf{I}_{r_A} - \mathbf{\Phi}_t\mathbf{\Phi}_t^\top)$ and $\delta_t:= \sigma_1(\mathbf{I}_{r_A} - \mathbf{\Psi}_t\mathbf{\Psi}_t^\top)$, and suppose that the learning rates are chosen as $\eta < 1$ and $\gamma=1$. If the following conditions are met, Alg. alg.rgd guarantees that $\mathsf{Tr}(\mathbf{\Phi}_{t+1}\mathbf{\Phi}_{t+1}^\top) \geq \mathsf{Tr}(\mathbf{\Phi}_t\m

Figures (11)

  • Figure 1: (a) Illustration of directional diversity collapse (DC) of $\tilde{\mathbf{w}}_i = \mathbf{w}_i / \| \mathbf{w}_i \|_2$ where $\mathbf{w}_i$ denotes the $i$-th row of low-rank update $\Delta \mathbf{W}$. (b) and (c) Diversity of update directions of LoRA and PoLAR for a Llama-2-7B down-projection layer on dataset Social-IQA, respectively. Each pixel shows $\| \tilde{\mathbf{w}}_i - \tilde{\mathbf{w}}_j \|_2$; rows and columns are rearranged to reveal cluster patterns in both plots. Emergence of a cluster pattern is evidence for DC. The algebraic rank is 32 for both methods, yet the stable rank is 1.06 and 5.49 for LoRA and PoLAR, respectively. See also Section \ref{['sec:overcoming-low-stable-rank']}.
  • Figure 2: $\mathsf{sr}(\Delta \mathbf{W})$ of Llama-2-7B low-rank updates fine-tuned on commonsense tasks with rank 32.
  • Figure 3: Overparameterized matrix factorization for different condition numbers and degrees of overparameterization using the PoLAR parameterization with Riemannian Gradient Descent (PO + RGD) and the Burer-Monteiro parameterization with Gradient Descent (BM + GD). (a) Asymmetric scenario and (b) symmetric case. The experimental details can be found in Appendix \ref{['appdx:mf-experimental-details']}.
  • Figure 4: Left two: Stable rank distribution of Llama-2-7B adapter layers trained with PoLAR and LoRA and different $r$ on the Social-IQA and HellaSwag datasets. See also Fig. \ref{['fig:appdx-stable-rank']}. Right two: Stable rank dynamics of the fifth transformer block over the course of training by layer type. See also Fig. \ref{['fig:stable-rank-dynamics']}.
  • Figure 5: Stable rank of $\Delta \mathbf{W}$ for Llama2-7B fine-tuned on commonsense reasoning tasks.
  • ...and 6 more figures

Theorems & Definitions (46)

  • Definition 1: Polar Decomposition
  • Lemma 1: Increasing Alignment
  • Theorem 1: Global Convergence
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • ...and 36 more