Table of Contents
Fetching ...

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

Meng Lou, Stanley Yu, Yizhou Yu

TL;DR

AdaRoute tackles the challenge of parameter-efficient fine-tuning for dense vision tasks by introducing a shared expert center with a lightweight dynamic router that assembles input-dependent projection weights. This MoE-inspired approach enables low-rank, input-conditioned adaptation and promotes cross-layer feature interaction through a shared parameter pool. The method incorporates dynamic multi-scale spatial mixing via depthwise convolutions and a spatial aggregation module, improving representation capacity without excessive parameter growth. Empirical results across semantic segmentation, object detection/instance segmentation, panoptic segmentation, and image classification show AdaRoute achieving state-of-the-art performance among PEFT methods and, in several cases, matching or surpassing full fine-tuning with only a small fraction of trainable parameters, highlighting its practical value for scalable model adaptation.

Abstract

Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: https://bit.ly/3NZcr0H.

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

TL;DR

AdaRoute tackles the challenge of parameter-efficient fine-tuning for dense vision tasks by introducing a shared expert center with a lightweight dynamic router that assembles input-dependent projection weights. This MoE-inspired approach enables low-rank, input-conditioned adaptation and promotes cross-layer feature interaction through a shared parameter pool. The method incorporates dynamic multi-scale spatial mixing via depthwise convolutions and a spatial aggregation module, improving representation capacity without excessive parameter growth. Empirical results across semantic segmentation, object detection/instance segmentation, panoptic segmentation, and image classification show AdaRoute achieving state-of-the-art performance among PEFT methods and, in several cases, matching or surpassing full fine-tuning with only a small fraction of trainable parameters, highlighting its practical value for scalable model adaptation.

Abstract

Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: https://bit.ly/3NZcr0H.
Paper Structure (28 sections, 5 equations, 5 figures, 15 tables)

This paper contains 28 sections, 5 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: (a) Classical adapter-based PEFT methods (e.g., Mona yin2025Mona). (b) Our proposed AdaRoute. Normalization layers and residual connections are omitted for simplicity. (c) The first and second rows show ERF and CKA visualizations for various fine-tuned models, respectively. Specifically, Swin-L model pre-trained on ImageNet-21K is used as the backbone network, which is fine-tuned on the COCO2017 using various fine-tuning methods and the Mask R-CNN framework. Quantitative results are listed in Table \ref{['tab:det']}.
  • Figure 2: An overview of our proposed AdaRoute. (a) denotes a hierarchical vision model equipped with shared expert centers and AdaRoute. (b) and (c) refer to the Swin- and ConvNeXt-style building blocks with AdaRoute, respectively.
  • Figure 3: The schematic workflow of dynamic parameter routing in AdaRoute.
  • Figure 4: A schematic diagram of dynamic multi-scale spatial mixing.
  • Figure 5: Expert activation maps in AdaRoute. Panels (a)-(d) are generated using 20, 40, 80, and 160 randomly selected images from the COCO2017 validation set, respectively. In each subfigure, the left and right heatmaps denote the expert activations for generating the channel-reducing matrix ($\mathbf{W}_1$) and the channel-expanding matrix ($\mathbf{W}_2$), respectively. The horizontal and vertical axes indicate the expert and the layer indices, respectively.