Table of Contents
Fetching ...

Orthogonal Finetuning Made Scalable

Zeju Qiu, Weiyang Liu, Adrian Weller, Bernhard Schölkopf

TL;DR

This work addresses the scalability bottlenecks of orthogonal finetuning (OFT) by introducing OFTv2, a matrix-free, input-centric reformulation that replaces weight-matrix multiplications with matrix-vector operations, reducing forward-time complexity from $O(nd^2)$ to $O(nd+d^2)$. It further improves orthogonal parameterization through Cayley-Neumann approximation, enabling inverse-free, stable training on very large foundation models, and extends the approach to quantized models via QOFT. Across diverse models (BART, Llama-2, Qwen2.5, Stable Diffusion 3.5), OFTv2 achieves up to 10x faster training and 3x lower GPU memory usage with performance on par with or better than LoRA/QLoRA, and QOFT demonstrates stronger stability and memory efficiency in quantized settings. The work thus delivers practical, scalable, and robust parameter-efficient finetuning suitable for ultra-large models and multi-modal tasks."

Abstract

Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in the Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.

Orthogonal Finetuning Made Scalable

TL;DR

This work addresses the scalability bottlenecks of orthogonal finetuning (OFT) by introducing OFTv2, a matrix-free, input-centric reformulation that replaces weight-matrix multiplications with matrix-vector operations, reducing forward-time complexity from to . It further improves orthogonal parameterization through Cayley-Neumann approximation, enabling inverse-free, stable training on very large foundation models, and extends the approach to quantized models via QOFT. Across diverse models (BART, Llama-2, Qwen2.5, Stable Diffusion 3.5), OFTv2 achieves up to 10x faster training and 3x lower GPU memory usage with performance on par with or better than LoRA/QLoRA, and QOFT demonstrates stronger stability and memory efficiency in quantized settings. The work thus delivers practical, scalable, and robust parameter-efficient finetuning suitable for ultra-large models and multi-modal tasks."

Abstract

Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in the Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.

Paper Structure

This paper contains 30 sections, 4 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: OFTv2 significantly reduces training time and GPU memory usage without sacrificing performance. The finetuning is performed with Qwen2.5-7B.
  • Figure 2: Comparison between LoRA and OFT.
  • Figure 3: Comparison between sequential (e.g., OFT) and parallel (e.g., LoRA) adaptation.
  • Figure 4: Results of GPU memory usage for the same finetuning task. (a) OFT, LoRA and OFTv2 on Qwen2.5; (b) QLoRA and QOFT on NF4-quantized Qwen2.5; (c) QLoRA and QOFT on AWQ-quantized Qwen2.5.
  • Figure 5: Qualitative results from Dreambooth finetuning of Stable Diffusion 3.5 Large (8.1B parameters), with peak allocated GPU memory: LoRA (52.33 GB), OFT (52.32 GB), QLoRA (41.60 GB) and QOFT (41.53 GB).
  • ...and 2 more figures