Table of Contents
Fetching ...

KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification

Yue Zhu, Haiwen Diao, Shang Gao, Long Chen, Huchuan Lu

TL;DR

KARST tackles the inefficiencies of parameter-efficient fine-tuning (PEFT) for large vision models by introducing a Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission. It decomposes the weight update as $\Delta{\mathbf{W}} = \sum_{i=1}^{N} \mathbf{C}_{i} \otimes (\mathbf{A}_{i} \mathbf{B}_{i})$, with Gaussian-initialized $\mathbf{A}_{i}$ and $\mathbf{C}_{i}$ and zero-initialized $\mathbf{B}_{i}$, and adds channel-wise re-scaling factors $\mathbf{s_{1}}$, $\mathbf{s_{2}}$ to better align with pre-trained feature distributions via $\mathbf{y} = (\mathbf{s_{1}}+\mathbf{1}) \odot (\mathbf{W_{0}}+\Delta{\mathbf{W}}) \mathbf{x} + \mathbf{s_{2}}$. This design expands the learning subspace while preserving inference efficiency after merging back the adaptation, addressing both representation limits and misalignment with pre-trained intermediate features. Empirical results on VTAB-1K and few-shot benchmarks show KARST outperforms existing PETL approaches across multiple backbones (e.g., ViT-B/16 and Swin-B) with comparable trainable parameters and negligible inference cost. The work provides a robust, generalizable path for adapting large vision transformers to diverse tasks with improved accuracy and efficiency, supported by publicly available code.

Abstract

Fine-tuning pre-trained vision models for specific tasks is a common practice in computer vision. However, this process becomes more expensive as models grow larger. Recently, parameter-efficient fine-tuning (PEFT) methods have emerged as a popular solution to improve training efficiency and reduce storage needs by tuning additional low-rank modules within pre-trained backbones. Despite their advantages, they struggle with limited representation capabilities and misalignment with pre-trained intermediate features. To address these issues, we introduce an innovative Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) for various recognition tasks. Specifically, its multi-kernel design extends Kronecker projections horizontally and separates adaptation matrices into multiple complementary spaces, reducing parameter dependency and creating more compact subspaces. Besides, it incorporates extra learnable re-scaling factors to better align with pre-trained feature distributions, allowing for more flexible and balanced feature aggregation. Extensive experiments validate that our KARST outperforms other PEFT counterparts with a negligible inference cost due to its re-parameterization characteristics. Code is publicly available at: https://github.com/Lucenova/KARST.

KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification

TL;DR

KARST tackles the inefficiencies of parameter-efficient fine-tuning (PEFT) for large vision models by introducing a Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission. It decomposes the weight update as , with Gaussian-initialized and and zero-initialized , and adds channel-wise re-scaling factors , to better align with pre-trained feature distributions via . This design expands the learning subspace while preserving inference efficiency after merging back the adaptation, addressing both representation limits and misalignment with pre-trained intermediate features. Empirical results on VTAB-1K and few-shot benchmarks show KARST outperforms existing PETL approaches across multiple backbones (e.g., ViT-B/16 and Swin-B) with comparable trainable parameters and negligible inference cost. The work provides a robust, generalizable path for adapting large vision transformers to diverse tasks with improved accuracy and efficiency, supported by publicly available code.

Abstract

Fine-tuning pre-trained vision models for specific tasks is a common practice in computer vision. However, this process becomes more expensive as models grow larger. Recently, parameter-efficient fine-tuning (PEFT) methods have emerged as a popular solution to improve training efficiency and reduce storage needs by tuning additional low-rank modules within pre-trained backbones. Despite their advantages, they struggle with limited representation capabilities and misalignment with pre-trained intermediate features. To address these issues, we introduce an innovative Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) for various recognition tasks. Specifically, its multi-kernel design extends Kronecker projections horizontally and separates adaptation matrices into multiple complementary spaces, reducing parameter dependency and creating more compact subspaces. Besides, it incorporates extra learnable re-scaling factors to better align with pre-trained feature distributions, allowing for more flexible and balanced feature aggregation. Extensive experiments validate that our KARST outperforms other PEFT counterparts with a negligible inference cost due to its re-parameterization characteristics. Code is publicly available at: https://github.com/Lucenova/KARST.

Paper Structure

This paper contains 11 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Framework of our proposed KARST. We first transform hidden features into $N$ kernel Kronecker products and then utilize shifting and scaling factors to align the merged outputs with the subsequent pre-trained layer.
  • Figure 2: Top-1 accuracy on fine-grained few-shot benchmark with ViT-B/16 as the backbone. Note that our KARST significantly outperforms other PETL competitors and consistently achieves the best results under the few-shot settings across the five fine-grained visual classification datasets.
  • Figure 3: Average accuracy of KARST on VTAB-1K with multiple kernels.