Orthogonal Finetuning Made Scalable
Zeju Qiu, Weiyang Liu, Adrian Weller, Bernhard Schölkopf
TL;DR
This work addresses the scalability bottlenecks of orthogonal finetuning (OFT) by introducing OFTv2, a matrix-free, input-centric reformulation that replaces weight-matrix multiplications with matrix-vector operations, reducing forward-time complexity from $O(nd^2)$ to $O(nd+d^2)$. It further improves orthogonal parameterization through Cayley-Neumann approximation, enabling inverse-free, stable training on very large foundation models, and extends the approach to quantized models via QOFT. Across diverse models (BART, Llama-2, Qwen2.5, Stable Diffusion 3.5), OFTv2 achieves up to 10x faster training and 3x lower GPU memory usage with performance on par with or better than LoRA/QLoRA, and QOFT demonstrates stronger stability and memory efficiency in quantized settings. The work thus delivers practical, scalable, and robust parameter-efficient finetuning suitable for ultra-large models and multi-modal tasks."
Abstract
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley-Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in the Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
