Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens
TL;DR
This work tackles the challenge of aligning vision and language representations within parameter-efficient fine-tuning (PEFT) using low-rank bottlenecks. It introduces routing functions—linear, parameter-free operations embedded inside the bottleneck—to steer VL alignment between down-projected features of text and visual inputs. Across encoder-only, decoder-only, and encoder-decoder models (RoBERTa, GPT-2, CLIP-BART, with ViT backbones), routing functions consistently improve VL tasks such as VQAv2 and COCO Captioning, often outperforming cross-attention with similar parameter budgets. The approach preserves efficiency while delivering substantial gains, and the authors provide extensive experiments, ablations, and a public code release to facilitate future exploration of routing-based VL PEFT.
Abstract
Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model's hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. These feature routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20\% improvement on VQAv2 ($\text{RoBERTa}_{\text{large}}$+ViT-L/16) and 30\% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks. Our code is available at https://github.com/tingyu215/Routing_VLPEFT.
