Table of Contents
Fetching ...

Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models

Zichen Miao, Wei Chen, Qiang Qiu

TL;DR

Coeff-Tuning reframes multi-head attention as a graph-filter subspace and learns a compact coefficient matrix $\boldsymbol{\alpha} \in \mathbb{R}^{H\times H}$ to linearly combine head filters $\mathbf{F}^i(\mathbf{X})$, expanding the expressive capacity beyond the convex hull of value embeddings. The method employs a residual parameterization $\boldsymbol{\alpha}'=\boldsymbol{\alpha}+\mathbf{I}$ and regularizes training with dropout on $\boldsymbol{\alpha}$, enabling stable, plug-and-play integration with existing PEFT approaches like LoRA. The authors provide theoretical insights that the original filter subspace yields bounded outputs, while the learned subspace enlarges the reachable output set, supported by toy demonstrations and complexity analyses. Empirically, Coeff-Tuning yields superior performance across few-shot ViT classification, personalized text-to-image generation, and multi-task image-text understanding, with minimal parameter overhead and strong compatibility with other PEFT methods.

Abstract

Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods in a plug-and-play manner. Extensive experiments show that our approach achieves superior performances than PEFT baselines with neglectable additional parameters.

Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models

TL;DR

Coeff-Tuning reframes multi-head attention as a graph-filter subspace and learns a compact coefficient matrix to linearly combine head filters , expanding the expressive capacity beyond the convex hull of value embeddings. The method employs a residual parameterization and regularizes training with dropout on , enabling stable, plug-and-play integration with existing PEFT approaches like LoRA. The authors provide theoretical insights that the original filter subspace yields bounded outputs, while the learned subspace enlarges the reachable output set, supported by toy demonstrations and complexity analyses. Empirically, Coeff-Tuning yields superior performance across few-shot ViT classification, personalized text-to-image generation, and multi-task image-text understanding, with minimal parameter overhead and strong compatibility with other PEFT methods.

Abstract

Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods in a plug-and-play manner. Extensive experiments show that our approach achieves superior performances than PEFT baselines with neglectable additional parameters.

Paper Structure

This paper contains 39 sections, 4 theorems, 15 equations, 10 figures, 5 tables.

Key Result

Proposition 4.2

If $\mathcal{S}^{h}$ is a bounded convex set, $\forall h=1,\cdots,H$, the set $\mathcal{S} = \mathcal{S}^{1}+\cdots+\mathcal{S}^{H}=\{\mathbf{x}^{1} + \cdots + \mathbf{x}^{H} \mid \forall \mathbf{x}^{h} \in \mathcal{S}^{h}\}$ is also a bounded convex set.

Figures (10)

  • Figure 1: (Up) Under the graph convolution view of the attention, the proposed Coeff-Tuning enhances the expressiveness by combining the attention maps with a tiny tunable subspace coefficient $\boldsymbol{\alpha}$. It can be seamlessly integrated with other PEFT methods, e.g., LoRA hu2021lora, by introducing neglectable additional parameters. (Bottom) Combining Coeff-Tuning with LoRA achieves better results in the concept customization task with SDXL podell2023sdxl.
  • Figure 2: Illustration of the proposed Coeff-Tuning. We first view each token $\mathbf{X}[i]$ as a node $V_i$ in an all-connected graph $\mathcal{G}=(V,E)$. The attention operation then turns into a graph convolution with each attention map $\mathbf{A}^h$ as a graph convolutional filter $\mathbf{F}^h(\mathbf{X})$. In this way, the attention maps in the multi-head attention layer form a filter subspace $\mathcal{F}(\mathbf{X})$, and each element in the filter is constrained in $(0, 1)$ after softmax. The proposed Coeff-Tuning then tunes a subspace coefficient $\boldsymbol{\alpha}\in \mathbb{R}^{H\times H}$ with a tiny number of parameters, that expands the range of attention scores by linearly combining the original filter subspace and enhances the expressiveness of the multi-head attention.
  • Figure 3: The effectiveness of tuning subspace coefficient $\boldsymbol{\alpha}$. The goal is to convert the input graph $\mathbf{X}$ to the target graph $\mathbf{O}$ using a single attention layer, $\mathbf{O} = \sum_{h=1}^{H} \mathbf{F}^h(\mathbf{X}) \mathbf{X} \mathbf{W}_v^h$, where $\mathbf{F}(\mathbf{X})=\text{softmax}(\mathbf{X} \mathbf{W}_q \mathbf{W}_k^{\intercal} \mathbf{X}^{\intercal})$. We have the following observations: (i) When only tuning $\mathbf{W}_q$ and $\mathbf{W}_k$, i.e., updating $\mathbf{F}(\mathbf{X})$, the positions of output nodes are constrained within the range of the input nodes $\mathbf{X}$, since the values of $\mathbf{F}(\mathbf{X})$ are in the range of $(0,1)$. (ii) In addition to tuning $\mathbf{F}(\mathbf{X})$, updating $\mathbf{W}_v$ projects the input as $\mathbf{X} \mathbf{W}_v$, but the output graph is still not aligned with the target graph. (iii) In addition to tuning $\mathbf{F}(\mathbf{X})$, adapting $\boldsymbol{\alpha}$ successfully transforms the input graph into the target graph. Compared with (i), $\boldsymbol{\alpha}$ expand the positions of output nodes $\sum_{i=1}^{H}\boldsymbol{\alpha}[:,i] \mathbf{F}^i(\mathbf{X}) \mathbf{X}$ beyond the constraints of the input nodes, enhancing the expressiveness of the attention effectively.
  • Figure 4: Results on Concept Customization. Compared to the baseline method (middle row), Coeff-Tuning (bottom row) leads to higher concept fidelity and text alignment of the generated images. For instance, our method accurately generates images of a "$\langle V \rangle$ plushie sloth in the snow," while the baseline fails to align with the text prompt. Additionally, our method generates a “$\langle V \rangle$ wooden pot” with higher fidelity to the input image.
  • Figure 5: Results on Concept Customization.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Definition 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Proposition A.1
  • proof
  • Proposition A.2
  • proof