Table of Contents
Fetching ...

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, Bernhard Schölkopf

TL;DR

The paper addresses the challenge of efficiently adapting very large foundation models by revisiting Orthogonal Finetuning (OFT) and introducing Orthogonal Butterfly (BOFT), a dense orthogonal parameterization built from sparse butterfly factorization. Through an information-transmission viewpoint, BOFT achieves parameter efficiency of O(d log d) while preserving orthogonality and enabling a tunable expressivity-regularity trade-off. The approach is validated across large language models, vision foundation models, and text-to-image diffusion models, where BOFT consistently outperforms LoRA and OFT under similar parameter budgets and, in many cases, approaches or matches full finetuning performance. The work demonstrates broad applicability, practical parameter savings, and a structured inductive bias that enhances generalization in downstream tasks.

Abstract

Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

TL;DR

The paper addresses the challenge of efficiently adapting very large foundation models by revisiting Orthogonal Finetuning (OFT) and introducing Orthogonal Butterfly (BOFT), a dense orthogonal parameterization built from sparse butterfly factorization. Through an information-transmission viewpoint, BOFT achieves parameter efficiency of O(d log d) while preserving orthogonality and enabling a tunable expressivity-regularity trade-off. The approach is validated across large language models, vision foundation models, and text-to-image diffusion models, where BOFT consistently outperforms LoRA and OFT under similar parameter budgets and, in many cases, approaches or matches full finetuning performance. The work demonstrates broad applicability, practical parameter savings, and a structured inductive bias that enhances generalization in downstream tasks.

Abstract

Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
Paper Structure (33 sections, 5 theorems, 15 equations, 19 figures, 9 tables)

This paper contains 33 sections, 5 theorems, 15 equations, 19 figures, 9 tables.

Key Result

Theorem 1

BOFT is more expressive than OFT with the same block size. For the butterfly matrix to approximate all orthogonal matrices of size $d$, we can multiply butterfly matrices with $\bm{B}_{d-1,1}(d)\bm{B}^\top_{d-1,2}(d)\cdots\bm{B}_{1,1}(d)\bm{B}^\top_{1,2}(d)$, where $\bm{B}_{i,j}(d),\forall i,\forall

Figures (19)

  • Figure 1: A comparison of reparameterization between LoRA and OFT.
  • Figure 2: An illustration of the information transmission view on generating dense matrices. This example uses $d=4$ and $m=5$.
  • Figure 3: An example of block-diagonal structure in OFT.
  • Figure 4: The butterfly structure ($d=8$).
  • Figure 5: Expressiveness of BOFT.
  • ...and 14 more figures

Theorems & Definitions (10)

  • Theorem 1: Expressivity of BOFT
  • Definition 1: Generalized Rotation Matrix
  • Definition 2: Orthogonal Butterfly Matrix
  • Definition 3: Diagonal and Scalar Butterfly Matrix
  • Remark 1
  • Proposition 1: peca2021numerical
  • Proposition 2: peca2021numerical
  • Proposition 3: Upper Bound for Simple Scalar Butterfly Matrices
  • Proposition 4: Upper Bound for General Scalar Butterfly Matrices
  • Definition 4: Random Orthogonal Butterfly Matrix