Table of Contents
Fetching ...

Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT

Da Chang, Peng Xue, Yu Li, Yongxiang Liu, Pengxiang Xu, Shixun Zhang

TL;DR

This paper investigates parameter-efficient fine-tuning (PEFT) for large pre-trained models and reframes DoRA as a weight-conditioning operation driven by increased singular value entropy. It shows that stable rank poorly explains performance, while the entropy of the weight-update spectrum better captures update diversity. Building a unified framework with two orthogonal design axes—Placement and Transformation—it introduces Pre-Diag and SORA, achieving superior accuracy and efficiency over LoRA and DoRA on NLP benchmarks. The findings advocate a principled shift from fixed low-rank updates to structured weight conditioning for scalable PEFT.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods are crucial for adapting large pre-trained models. Among these, LoRA is considered a foundational approach. Building on this, the influential DoRA method enhances performance by decomposing weight updates into magnitude and direction. However, its underlying mechanism remains unclear, and it introduces significant computational overhead. In this work, we first identify that DoRA's success stems from its capacity to increase the singular value entropy of the weight update matrix, which promotes a more uniform update distribution akin to full fine-tuning. We then reformulate DoRA into a mathematically equivalent and more efficient matrix form, revealing it as a learnable weight conditioning method. Based on this insight, we propose a unified framework for designing advanced PEFT methods by exploring two orthogonal dimensions: the architectural placement and the transformation type of the conditioning matrix. Within this framework, we introduce two novel methods: (1) \textbf{Pre-Diag}, which applies a diagonal conditioning matrix before the LoRA update to efficiently calibrate the pre-trained weights, thereby enhancing performance while reducing training time; and (2) \textbf{S}kewed \textbf{O}rthogonal \textbf{R}otation \textbf{A}daptation (\textbf{SORA}), which employs a parameter-efficient orthogonal rotation to perform a more powerful, norm-preserving transformation of the feature space. Extensive experiments on natural language understanding and generation tasks demonstrate that our proposed methods achieve superior performance and efficiency compared to both LoRA and DoRA. The code is available at https://github.com/MaeChd/SORA.

Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT

TL;DR

This paper investigates parameter-efficient fine-tuning (PEFT) for large pre-trained models and reframes DoRA as a weight-conditioning operation driven by increased singular value entropy. It shows that stable rank poorly explains performance, while the entropy of the weight-update spectrum better captures update diversity. Building a unified framework with two orthogonal design axes—Placement and Transformation—it introduces Pre-Diag and SORA, achieving superior accuracy and efficiency over LoRA and DoRA on NLP benchmarks. The findings advocate a principled shift from fixed low-rank updates to structured weight conditioning for scalable PEFT.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods are crucial for adapting large pre-trained models. Among these, LoRA is considered a foundational approach. Building on this, the influential DoRA method enhances performance by decomposing weight updates into magnitude and direction. However, its underlying mechanism remains unclear, and it introduces significant computational overhead. In this work, we first identify that DoRA's success stems from its capacity to increase the singular value entropy of the weight update matrix, which promotes a more uniform update distribution akin to full fine-tuning. We then reformulate DoRA into a mathematically equivalent and more efficient matrix form, revealing it as a learnable weight conditioning method. Based on this insight, we propose a unified framework for designing advanced PEFT methods by exploring two orthogonal dimensions: the architectural placement and the transformation type of the conditioning matrix. Within this framework, we introduce two novel methods: (1) \textbf{Pre-Diag}, which applies a diagonal conditioning matrix before the LoRA update to efficiently calibrate the pre-trained weights, thereby enhancing performance while reducing training time; and (2) \textbf{S}kewed \textbf{O}rthogonal \textbf{R}otation \textbf{A}daptation (\textbf{SORA}), which employs a parameter-efficient orthogonal rotation to perform a more powerful, norm-preserving transformation of the feature space. Extensive experiments on natural language understanding and generation tasks demonstrate that our proposed methods achieve superior performance and efficiency compared to both LoRA and DoRA. The code is available at https://github.com/MaeChd/SORA.

Paper Structure

This paper contains 27 sections, 2 theorems, 22 equations, 5 figures, 6 tables.

Key Result

Theorem 1

Let $\Delta \mathbf{W}_{\text{LoRA}}$ be a LoRA update matrix of rank $r$. Its set of normalized singular values is $\Sigma_{\text{LoRA}} = \{1, \underbrace{\alpha, \dots, \alpha}_{r-1}\}$. Let $\Delta \mathbf{W}_{\text{DoRA}}$ be a DoRA update matrix of rank $s > r$. Its set of normalized singular

Figures (5)

  • Figure 1: (a) DeBERTaV3-Base on GLUE Benchmark: Stable Rank $\|\Delta \mathbf{W}\|_F^2 / \|\Delta \mathbf{W}\|_2^2$ Across Layers; (b) DeBERTaV3-Base on GLUE Benchmark: SVD Entropy $H(\sigma) = -\sum_i p_i \log p_i$ Across Layers, with Comparisons of Full Fine-Tuning, LoRA, and DoRA.
  • Figure 2: Distribution of average singular values after layer-wise normalization for LoRA and DoRA on GLUE tasks with DeBERTaV3-Base.
  • Figure 3: Taking LoRA as the baseline, we can regard DoRA and OFT as weight-conditioned methods built upon it. Our method Pre-Diag, in turn, adjusts pre-trained weights by using a trainable diagonal matrix based on DoRA; SORA, on the other hand, replaces the conditional matrix in DoRA with a trainable low-parameter orthogonal matrix to automatically rotate input features.
  • Figure 4: SVD Entropy of DeBERTaV3-Base Fine-Tuned on GLUE tasks: Comparing LoRA, DoRA, Our Pre-Diag, and SORA.
  • Figure 5: Comparison of inference speed (steps per second) and training speed (steps per second) among LoRA, DoRA, and our proposed method on the PIQA, ARC-c, and OBQA tasks, using the LLaMA3-8B model.

Theorems & Definitions (6)

  • Theorem 1
  • Remark 1
  • Theorem 2
  • Remark 2
  • proof
  • proof