iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation

Hayeon Jo; Hyesong Choi; Minhee Cho; Dongbo Min

iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation

Hayeon Jo, Hyesong Choi, Minhee Cho, Dongbo Min

TL;DR

This paper proposes a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances that achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation.

Abstract

Transfer learning based on full fine-tuning (FFT) of the pre-trained encoder and task-specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine-tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task-specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.

iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 5 figures, 7 tables)

This paper contains 26 sections, 7 equations, 5 figures, 7 tables.

Introduction
Related Work
Transformer in Vision
Parameter Efficient Fine Tuning
Preliminary
Vision Transformer and its Variants
PEFT Methods
Proposed Method
Motivation and Overview
Input-Conditioned Network (iCoN)
Visual Analysis of Local Representation
Experiments
Experimental Settings
Datasets and Downstream Tasks
Pretrained Backbones
...and 11 more sections

Figures (5)

Figure 1: Quantitative comparison with full fine-tuning (FFT) and PEFT approaches. The top graph compares depth prediction errors on NYU-v2 nyuv2, while the bottom graph presents performance for semantic segmentation on ADE20K ade20k and instance segmentation on COCO coco. iConFormer consistently outperforms recent PEFT methods in all dense tasks and surpasses FFT in instance segmentation.
Figure 2: Comparison of Full Fine-Tuning (FFT) and the proposed Parameter Efficient Fine-Tuning (PEFT) using iConFormer. (a) FFT, where all parameters are updated during training. (b) Our PEFT (iConFormer), where an dynamic adapter is attached sequentially after the MLP layer in the Transformer. Inside the dynamic adapter, an Input-Conditioned Network (iCoN) generates input-conditioned convolutional kernels in a channel basis, which is detailed in Figure \ref{['icon_architecture']}. By convolving features with these kernels, iCoN adaptively refines them in accordance with the specific properties of the input, thereby enhancing the model’s capability to effectively process diverse input data in the downstream tasks.
Figure 3: Illustration of the sequential and parallel configurations. The sequential design is shown on the left, and the parallel design on the right.
Figure 4: Architecture of input-Conditioned Network (iCoN). The down-projected feature map $\hat{x}$ is used to dynamically generate channel-wise convolution kernels through the iCoN.
Figure 5: Comparison of Attention Maps from AdaptFormer and iConFormer. We visualize the attention maps using attention rollout rollout. The top row represents input images, and the middle and bottom rows present the attention maps generated by AdaptFormer adaptformer and iConFormer, respectively, with both using Swin Transformer backbone swin. iConFormer more accurately delineates object regions and captures fine-grained semantics, compared to the AdaptFormer.

iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation

TL;DR

Abstract

iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)