Table of Contents
Fetching ...

Group and Shuffle: Efficient Structured Orthogonal Parametrization

Mikhail Gorbunov, Nikolay Yudin, Vera Soboleva, Aibek Alanov, Alexey Naumov, Maxim Rakhuba

TL;DR

This work introduces GS-matrices, a Group-and-Shuffle structured class, to enable dense, parameter-efficient orthogonal parametrizations for fine-tuning pretrained models. By proving that orthogonality can be achieved via orthogonal blocks and carefully chosen permutations, the authors build GSOFT and Double GSOFT frameworks and extend the approach to GS Orthogonal Convolutions, achieving benefits in both NLP and vision tasks. The method demonstrates improved parameter-efficiency and competitive performance on GLUE, subject-driven diffusion generation, and 1-Lipschitz convolutional nets, while reducing the number of trainable parameters and FLOPs relative to prior BOFT approaches. Overall, GS-matrices offer a flexibly scalable path to efficient orthogonal fine-tuning with broad applicability and practical impact for large-scale pretrained models.

Abstract

The increasing size of neural networks has led to a growing demand for methods of efficient fine-tuning. Recently, an orthogonal fine-tuning paradigm was introduced that uses orthogonal matrices for adapting the weights of a pretrained model. In this paper, we introduce a new class of structured matrices, which unifies and generalizes structured classes from previous works. We examine properties of this class and build a structured orthogonal parametrization upon it. We then use this parametrization to modify the orthogonal fine-tuning framework, improving parameter and computational efficiency. We empirically validate our method on different domains, including adapting of text-to-image diffusion models and downstream task fine-tuning in language modeling. Additionally, we adapt our construction for orthogonal convolutions and conduct experiments with 1-Lipschitz neural networks.

Group and Shuffle: Efficient Structured Orthogonal Parametrization

TL;DR

This work introduces GS-matrices, a Group-and-Shuffle structured class, to enable dense, parameter-efficient orthogonal parametrizations for fine-tuning pretrained models. By proving that orthogonality can be achieved via orthogonal blocks and carefully chosen permutations, the authors build GSOFT and Double GSOFT frameworks and extend the approach to GS Orthogonal Convolutions, achieving benefits in both NLP and vision tasks. The method demonstrates improved parameter-efficiency and competitive performance on GLUE, subject-driven diffusion generation, and 1-Lipschitz convolutional nets, while reducing the number of trainable parameters and FLOPs relative to prior BOFT approaches. Overall, GS-matrices offer a flexibly scalable path to efficient orthogonal fine-tuning with broad applicability and practical impact for large-scale pretrained models.

Abstract

The increasing size of neural networks has led to a growing demand for methods of efficient fine-tuning. Recently, an orthogonal fine-tuning paradigm was introduced that uses orthogonal matrices for adapting the weights of a pretrained model. In this paper, we introduce a new class of structured matrices, which unifies and generalizes structured classes from previous works. We examine properties of this class and build a structured orthogonal parametrization upon it. We then use this parametrization to modify the orthogonal fine-tuning framework, improving parameter and computational efficiency. We empirically validate our method on different domains, including adapting of text-to-image diffusion models and downstream task fine-tuning in language modeling. Additionally, we adapt our construction for orthogonal convolutions and conduct experiments with 1-Lipschitz neural networks.
Paper Structure (24 sections, 3 theorems, 28 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 3 theorems, 28 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Let $A$ be a matrix from $\mathcal{GS}$$(I, P, I)$ with a permutation matrix $P$ defined by the function $\sigma: \{0,\dots,n-1\}\to \{0,\dots,n-1\}$. Let $\{v_i^\top\}$ -- be the rows of the blocks $R_1,\dots,R_{k_R}$, $\{u_i\}$ -- the columns of the blocks $L_1,\dots,L_{k_L}$ in the consecutive or Note that we use zero-indexing for this proposition for simplicity of formulas.

Figures (8)

  • Figure 1: $\mathcal{GS}$$(I, P, I)$ matrices with $b_L^1 = b_L^2 = 3$, $b_R^1 = b_R^2 = 2$, $k_L = 2, k_R = 3$. Edges between nodes denote nonzero weights.
  • Figure 2: Illustration of Proposition \ref{['prop:lr']} that provides block low-rank interpretation of $\mathcal{GS}$$(I,P,I)$ matrices. The matrix $R$ contains 2 blocks and matrix $L$ contains 4 blocks.
  • Figure 3: Illustraion of $P_{(k, 12)}$ permutations for $k \in \{3, 4, 6, 2\}$.
  • Figure 4: Subject-driven generation visual results on 3000 training iterations.
  • Figure 5: Demonstration of information transition through a block structure. Each node is connected to exactly $b$ consecutive nodes from the next level.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Definition 3.1
  • Proposition 1
  • Theorem 1
  • proof
  • Definition 5.1
  • Remark 1
  • Remark 2
  • Definition 5.2: pmlr-v162-dao22a
  • Theorem 2
  • proof
  • ...and 5 more