Table of Contents
Fetching ...

ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts

Sinan Du, Guosheng Zhang, Keyao Wang, Yuanrui Wang, Haixiao Yue, Gang Zhang, Errui Ding, Jingdong Wang, Zhengzhuo Xu, Chun Yuan

TL;DR

ALoRE tackles the challenge of efficiently adapting large vision models by aggregating multiple low-rank experts in a Kronecker-product–based hypercomplex space, forming a multi-branch architecture that decouples learned patterns while keeping parameter growth negligible. The method uses purely linear transformations and sequential re-parameterization to merge into the backbone, avoiding inference latency. Across 24 downstream tasks and multiple backbones, ALoRE consistently outperforms full fine-tuning and existing PETL methods with minimal trainable parameters, and ablations confirm the importance of bottleneck size, expert count, and placement. Visualizations corroborate that different experts specialize in complementary visual cues, supporting the decoupling of features and enhanced adaptation efficiency. The work offers a scalable, practical PETL solution with clear implications for multi-task learning and resource-constrained deployment of large vision models.

Abstract

Parameter-efficient transfer learning (PETL) has become a promising paradigm for adapting large-scale vision foundation models to downstream tasks. Typical methods primarily leverage the intrinsic low rank property to make decomposition, learning task-specific weights while compressing parameter size. However, such approaches predominantly manipulate within the original feature space utilizing a single-branch structure, which might be suboptimal for decoupling the learned representations and patterns. In this paper, we propose ALoRE, a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts using a multi-branch paradigm, disentangling the learned cognitive patterns during training. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone via re-parameterization in a sequential manner, avoiding additional inference latency. We conduct extensive experiments on 24 image classification tasks using various backbone variants. Experimental results demonstrate that ALoRE outperforms the full fine-tuning strategy and other state-of-the-art PETL methods in terms of performance and parameter efficiency. For instance, ALoRE obtains 3.06% and 9.97% Top-1 accuracy improvement on average compared to full fine-tuning on the FGVC datasets and VTAB-1k benchmark by only updating 0.15M parameters.

ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts

TL;DR

ALoRE tackles the challenge of efficiently adapting large vision models by aggregating multiple low-rank experts in a Kronecker-product–based hypercomplex space, forming a multi-branch architecture that decouples learned patterns while keeping parameter growth negligible. The method uses purely linear transformations and sequential re-parameterization to merge into the backbone, avoiding inference latency. Across 24 downstream tasks and multiple backbones, ALoRE consistently outperforms full fine-tuning and existing PETL methods with minimal trainable parameters, and ablations confirm the importance of bottleneck size, expert count, and placement. Visualizations corroborate that different experts specialize in complementary visual cues, supporting the decoupling of features and enhanced adaptation efficiency. The work offers a scalable, practical PETL solution with clear implications for multi-task learning and resource-constrained deployment of large vision models.

Abstract

Parameter-efficient transfer learning (PETL) has become a promising paradigm for adapting large-scale vision foundation models to downstream tasks. Typical methods primarily leverage the intrinsic low rank property to make decomposition, learning task-specific weights while compressing parameter size. However, such approaches predominantly manipulate within the original feature space utilizing a single-branch structure, which might be suboptimal for decoupling the learned representations and patterns. In this paper, we propose ALoRE, a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts using a multi-branch paradigm, disentangling the learned cognitive patterns during training. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone via re-parameterization in a sequential manner, avoiding additional inference latency. We conduct extensive experiments on 24 image classification tasks using various backbone variants. Experimental results demonstrate that ALoRE outperforms the full fine-tuning strategy and other state-of-the-art PETL methods in terms of performance and parameter efficiency. For instance, ALoRE obtains 3.06% and 9.97% Top-1 accuracy improvement on average compared to full fine-tuning on the FGVC datasets and VTAB-1k benchmark by only updating 0.15M parameters.

Paper Structure

This paper contains 35 sections, 8 equations, 7 figures, 24 tables.

Figures (7)

  • Figure 1: We present the comparisons of performance with other existing PETL methods with the ViT-B/16 model on the VTAB-1k benchmark. Our ALoRE achieves state-of-the-art Top-1 accuracy (%) on average in a broad range of 19 downstream tasks.
  • Figure 2: The comparisons of performance and throughput during inference. Our ALoRE achieves the theoretical optimal throughput by maintaining the desirable property of re-parameterization.
  • Figure 3: Illustration of Aggregated Low Rank Experts method. The ALoRE block can be reduced to a simple linear transformation layer after training. Afterward, the reduced linear weights can be re-parameterized into the first projection layer of the nearest module.
  • Figure 4: Visualization of attention maps with respect to different experts. "Single-$i$" denotes the exclusive preservation of the $i$-th expert only. "Increment-4" represents the aggregation of all experts.
  • Figure 5: t-SNE visualizations of different fine-tuning methods on the VTAB-1k benchmark. We use the embeddings of [CLS] after the last transformer layer and before the classification head. All results are acquired using ViT-B/16 pre-trained on ImageNet-21K. Our ALoRE attains better manifolds and feature clustering results compared to other fine-tuning strategies.
  • ...and 2 more figures