Table of Contents
Fetching ...

CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Ahmed

Abstract

Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3\% and 2\%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.

CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

Abstract

Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3\% and 2\%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.

Paper Structure

This paper contains 37 sections, 6 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of LoRA and CoLA in dual-encoder architectures for multimodal tasks. (a) LoRA applies independent low-rank adaptation within each modality without cross-modal interaction. (b) CoLA enables cross-modal interaction through inter-modal fusion pathways, allowing information exchange between Modality 1 and Modality 2 during the low-rank adaptation process. Modality 1 and Modality 2 can be vision, language, or audio. The multimodal tasks include vision-language (REC and RES) and audio-visual (AVE and AVS) downstream tasks.
  • Figure 2: (Left) The overall architecture of CoLA applied to pre-trained linear components $W_0$ in transformer blocks with the intra-modal pathway $\Delta W_{L}$ and inter-modal fusion pathway $\Delta W_{C}$ in Equation \ref{['eqn:cola_formula']}, which integrates dynamic weights from cross-modal features via a hypernetwork. (Right) Illustration of the progressive cross-modal propagation between dual encoders, transferring cross-modal features to linear component with CoLA in self-attention (SA: $W_{qkv}$), output projection (OUT: $W_o$) and FFN module up-projection (UP: $W_{up}$), and down-projection (DOWN: $W_{down}$).
  • Figure 3: Visualization of learned scaling factors $\lambda$ across transformer layers for different components ($W_q$, $W_k$, $W_v$, $W_o$, $W_{up}$, $W_{down}$) in dual-encoder architectures. The plots show how cross-modal interaction strength varies by layer depth and component type for vision-language and audio-visual tasks, with higher $\lambda$ values indicating stronger cross-modal influence.
  • Figure 4: Illustration of different sharing strategies for CoLA low-rank matrices between pathways: (a) Fully shared, (b) Partially shared A, (c) Partially shared B, (d) Fully non-shared.
  • Figure 5: Comparison of cross-modal propagation strategies: (a) uniform, (b) module-wise, and (c) progressive designs
  • ...and 1 more figures