Table of Contents
Fetching ...

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

This work tackles the challenge of aligning vision and language representations within parameter-efficient fine-tuning (PEFT) using low-rank bottlenecks. It introduces routing functions—linear, parameter-free operations embedded inside the bottleneck—to steer VL alignment between down-projected features of text and visual inputs. Across encoder-only, decoder-only, and encoder-decoder models (RoBERTa, GPT-2, CLIP-BART, with ViT backbones), routing functions consistently improve VL tasks such as VQAv2 and COCO Captioning, often outperforming cross-attention with similar parameter budgets. The approach preserves efficiency while delivering substantial gains, and the authors provide extensive experiments, ablations, and a public code release to facilitate future exploration of routing-based VL PEFT.

Abstract

Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model's hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. These feature routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20\% improvement on VQAv2 ($\text{RoBERTa}_{\text{large}}$+ViT-L/16) and 30\% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks. Our code is available at https://github.com/tingyu215/Routing_VLPEFT.

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

TL;DR

This work tackles the challenge of aligning vision and language representations within parameter-efficient fine-tuning (PEFT) using low-rank bottlenecks. It introduces routing functions—linear, parameter-free operations embedded inside the bottleneck—to steer VL alignment between down-projected features of text and visual inputs. Across encoder-only, decoder-only, and encoder-decoder models (RoBERTa, GPT-2, CLIP-BART, with ViT backbones), routing functions consistently improve VL tasks such as VQAv2 and COCO Captioning, often outperforming cross-attention with similar parameter budgets. The approach preserves efficiency while delivering substantial gains, and the authors provide extensive experiments, ablations, and a public code release to facilitate future exploration of routing-based VL PEFT.

Abstract

Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model's hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. These feature routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20\% improvement on VQAv2 (+ViT-L/16) and 30\% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks. Our code is available at https://github.com/tingyu215/Routing_VLPEFT.
Paper Structure (15 sections, 8 figures, 17 tables)

This paper contains 15 sections, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Illustration of LoRA and Adapter used in a Transformer block. Only the green modules are updated in PEFT. We identify the low-rank bottleneck with the orange rectangle. The Adapters are sequentially added to the Transformer block.
  • Figure 2: Illustration of our method. Left: Conventional PEFT methods with low-rank bottleneck first map the hidden states $x_H$ from a high dimension $d$ to a lower dimension $r$ via down-projection map $W_{down}$. Then, it maps the hidden states back to the original dimension $d$ via $W_{up}$. Middle: Our method utilizes the same architecture, but updates the features via a routing function in the low-rank bottleneck. Specifically in a VL task, for the features $x_R$ that we want to align to, we use $W_{down}$ to down-project them as $W_{down}x_R$. Then the routing function routes $W_{down}x_R$ and $W_{down}x_H$ in the low-rank bottleneck. Right: Different routing functions. In Adapter, routing functions are added before nonlinear activation functions.
  • Figure 3: Pipeline for VQA and Image Captioning. Only the added classifier (in VQA) and the PEFT modules of the RoBERTa encoder and GPT decoder are tuned. Note that here $x_H$ are the text inputs with the visual [CLS] features prepended.
  • Figure 4: Example of average attention weights from the last layer of GPT2. We use the final checkpoint trained on COCO Cap. with $r=64$. IMAGE: visual [CLS] feature.
  • Figure 5: Qualitative examples from VQA and Image Captioning tasks. We present cases where the model trained with conventional LoRA fails and results obtained from a model trained with LoRA with a routing function. We use $r=64$.
  • ...and 3 more figures