Table of Contents
Fetching ...

Large Convolutional Model Tuning via Filter Subspace

Wei Chen, Zichen Miao, Qiang Qiu

TL;DR

This work tackles the high resource cost of fine-tuning large convolutional models by introducing a filter subspace view in which convolutional weights are decomposed into a small set of spatial filter atoms and a fixed channel-mixing set of coefficients. By updating only the filter atoms and, optionally, an expanded overcomplete atom set, the method achieves strong task adaptation with far fewer trainable parameters than baselines while preserving the capabilities of pre-trained models. The approach is demonstrated on discriminative tasks with ResNet50 and ConvNeXt, and on generative tasks with Stable Diffusion, showing competitive or superior accuracy and FID with dramatically reduced parameter updates. Across both domains, the experiments reveal that maintaining fixed atom coefficients is crucial for generalization, while recursive atom decomposition offers a scalable path to increased tuning capacity without excessive parameter growth.

Abstract

Efficient fine-tuning methods are critical to address the high computational and parameter complexity while adapting large pre-trained models to downstream tasks. Our study is inspired by prior research that represents each convolution filter as a linear combination of a small set of filter subspace elements, referred to as filter atoms. In this paper, we propose to fine-tune pre-trained models by adjusting only filter atoms, which are responsible for spatial-only convolution, while preserving spatially-invariant channel combination knowledge in atom coefficients. In this way, we bring a new filter subspace view for model tuning. Furthermore, each filter atom can be recursively decomposed as a combination of another set of atoms, which naturally expands the number of tunable parameters in the filter subspace. By only adapting filter atoms constructed by a small number of parameters, while maintaining the rest of model parameters constant, the proposed approach is highly parameter-efficient. It effectively preserves the capabilities of pre-trained models and prevents overfitting to downstream tasks. Extensive experiments show that such a simple scheme surpasses previous tuning baselines for both discriminate and generative tasks.

Large Convolutional Model Tuning via Filter Subspace

TL;DR

This work tackles the high resource cost of fine-tuning large convolutional models by introducing a filter subspace view in which convolutional weights are decomposed into a small set of spatial filter atoms and a fixed channel-mixing set of coefficients. By updating only the filter atoms and, optionally, an expanded overcomplete atom set, the method achieves strong task adaptation with far fewer trainable parameters than baselines while preserving the capabilities of pre-trained models. The approach is demonstrated on discriminative tasks with ResNet50 and ConvNeXt, and on generative tasks with Stable Diffusion, showing competitive or superior accuracy and FID with dramatically reduced parameter updates. Across both domains, the experiments reveal that maintaining fixed atom coefficients is crucial for generalization, while recursive atom decomposition offers a scalable path to increased tuning capacity without excessive parameter growth.

Abstract

Efficient fine-tuning methods are critical to address the high computational and parameter complexity while adapting large pre-trained models to downstream tasks. Our study is inspired by prior research that represents each convolution filter as a linear combination of a small set of filter subspace elements, referred to as filter atoms. In this paper, we propose to fine-tune pre-trained models by adjusting only filter atoms, which are responsible for spatial-only convolution, while preserving spatially-invariant channel combination knowledge in atom coefficients. In this way, we bring a new filter subspace view for model tuning. Furthermore, each filter atom can be recursively decomposed as a combination of another set of atoms, which naturally expands the number of tunable parameters in the filter subspace. By only adapting filter atoms constructed by a small number of parameters, while maintaining the rest of model parameters constant, the proposed approach is highly parameter-efficient. It effectively preserves the capabilities of pre-trained models and prevents overfitting to downstream tasks. Extensive experiments show that such a simple scheme surpasses previous tuning baselines for both discriminate and generative tasks.
Paper Structure (60 sections, 8 equations, 26 figures, 10 tables)

This paper contains 60 sections, 8 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: Compared with LoRA hu2021lora, our method updates a small set of parameters and mitigates the risk of overfitting to the target concept. In this example task of learning the concept $\langle$castle$\rangle$ from 5 input images, we only require fine-tuning 0.75M parameters, a stark reduction from the 22.67M parameters required by LoRA. Furthermore, the model fine-tuned by our method captures the target concept while ensuring a rich diversity and strong alignment with input text prompts. It demonstrates that our approach preserves the generalization capability of large models. The text prompts used to generate images from left to right are: "The $\langle$castle$\rangle$ stands against a backdrop of snow-capped mountains", "A $\langle$castle$\rangle$ surrounded by a lush, vibrant forest", "The $\langle$castle$\rangle$ overlooks a serene lake", and "The $\langle$castle$\rangle$ in the autumn season with colorful foliage".
  • Figure 2: A filter subspace view of model tuning. (a) Each convolutional layer $\mathcal{F}$ is constructed as linear combinations of filter subspace elements, i.e., filter atoms$\mathbf{D}$, using decomposition coefficients, i.e., atom coefficients$\boldsymbol{\alpha}$. Hypothesizing variations across tasks can be reduced by bridging spatial discrepancies in the images, we propose to calibrate the pre-trained model by solely fine-tuning the spatial-only $\mathbf{D}$ while keeping the spatially-invariant channel weights $\boldsymbol{\alpha}$ fixed. (b) The convolution operation is represented in two stages: At the spatial-only convolution stage, each filter atom $\mathbf{D}$ is adapting to the target task, and then at the cross-channel mixing stage, $\mathbf{Z}'$ are combined into $\mathbf{Z}$ using fixed atom coefficients $\boldsymbol{\alpha}$, which is obtained from the pre-trained model. More details are provided in Section \ref{['sec:method']}.
  • Figure 3: (a) The process of constructing overcomplete filter atoms: One filter atom in $\mathbf{D}$ is presented as a linear combination of $m_1$ filter atoms in $\mathbf{D}_1$. (b) The convolution operation is represented in three stages: At the spatial-only convolution stage, a set of overcomplete filter atoms $\mathbf{D}_{1}$ is adapting to the target task and generates $m\cdot m_1$ feature channels by convolving with each input channel. At the intra-channel mixing stage, every group of $m_1$ features is linearly combined by coefficients $\boldsymbol{\beta}$ with no cross-channel mixing. At the cross-channel mixing stage, $\mathbf{Z}'$ are combined into $\mathbf{Z}$ using fixed atom coefficients $\boldsymbol{\alpha}$, which is obtained from the pre-trained model.
  • Figure 4: Formulate the linear layer $\mathbf{W}$ as a combination of atoms, $\mathbf{W} = \boldsymbol{\alpha}_c \times \mathbf{D}_c$.
  • Figure 5: Fine-tune Stable Diffusion rombach2022high to learn the concept $\langle$castle$\rangle$ from (a) and generate images using text prompt: "A peacock in front of the $\langle$castle$\rangle$". (b) Images generated by the pre-trained model, without any fine-tuning, align well with the text prompt but fail to match the target concept. (d) When only fine-tuning the filter atoms $\mathbf{D}$, our approach achieves good alignment with the text prompt and the target concept. (h) However, fine-tuning the spatially-invariant channel weights, i.e., atom coefficients $\boldsymbol{\alpha}$, results in generated images only align with the target concept, compromising the capability of the pre-trained model of producing diverse images that align with the text prompt. This issue is also observed in (e-g).
  • ...and 21 more figures