Table of Contents
Fetching ...

M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

Xuyang Liu, Ting Liu, Siteng Huang, Yi Xin, Yue Hu, Quanjun Yin, Donglin Wang, Yuanyuan Wu, Honggang Chen

TL;DR

Referring expression comprehension (REC) requires strong cross-modal grounding but fully fine-tuning vision-language foundation models is costly. The authors propose M2IST, a Multi-Modal Interactive Side-Tuning framework that freezes the heavy Vision and Language Encoders and tunes a Mixture of Multi-Modal Interactive Side Adapters (M3ISAs) to enrich cross-modal alignment through intra-modality adapters (VEA, LEA) and an inter-modality adapter (IEA). This design yields a remarkable efficiency gain, using around $2.11\%$ of tunable parameters, $39.61\%$ of GPU memory, and $63.46\%$ of the fine-tuning time of full fine-tuning, while maintaining competitive accuracy on RefCOCO, RefCOCO+, and RefCOCOg, and generalizing to phrase grounding on ReferItGame. The approach demonstrates a practical path for deploying powerful vision-language models to REC and related tasks with reduced computational demands, with code released to facilitate adoption. $L = L_{\text{smooth-l1}} + \lambda L_{\text{GIoU}}$ is used as the training objective to optimize bounding-box predictions.

Abstract

Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M2IST: Multi-Modal Interactive Side-Tuning with M3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we fix the pre-trained uni-modal encoders and update M3ISAs to enable efficient vision-language alignment for REC. Empirical results reveal that M2IST achieves better performance-efficiency trade-off than full fine-tuning and other PETL methods, requiring only 2.11\% tunable parameters, 39.61\% GPU memory, and 63.46\% training time while maintaining competitive performance. Our code is released at https://github.com/xuyang-liu16/M2IST.

M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

TL;DR

Referring expression comprehension (REC) requires strong cross-modal grounding but fully fine-tuning vision-language foundation models is costly. The authors propose M2IST, a Multi-Modal Interactive Side-Tuning framework that freezes the heavy Vision and Language Encoders and tunes a Mixture of Multi-Modal Interactive Side Adapters (M3ISAs) to enrich cross-modal alignment through intra-modality adapters (VEA, LEA) and an inter-modality adapter (IEA). This design yields a remarkable efficiency gain, using around of tunable parameters, of GPU memory, and of the fine-tuning time of full fine-tuning, while maintaining competitive accuracy on RefCOCO, RefCOCO+, and RefCOCOg, and generalizing to phrase grounding on ReferItGame. The approach demonstrates a practical path for deploying powerful vision-language models to REC and related tasks with reduced computational demands, with code released to facilitate adoption. is used as the training objective to optimize bounding-box predictions.

Abstract

Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M2IST: Multi-Modal Interactive Side-Tuning with M3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we fix the pre-trained uni-modal encoders and update M3ISAs to enable efficient vision-language alignment for REC. Empirical results reveal that M2IST achieves better performance-efficiency trade-off than full fine-tuning and other PETL methods, requiring only 2.11\% tunable parameters, 39.61\% GPU memory, and 63.46\% training time while maintaining competitive performance. Our code is released at https://github.com/xuyang-liu16/M2IST.
Paper Structure (39 sections, 6 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 39 sections, 6 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of (a) fully fine-tuning, (b) Adapter-tuning, and (c) our M2IST for REC. By only using 2.11% of the tunable parameters, 39.61% of the GPU memory, and 63.46% of the fine-tuning time, M2IST achieves comparable or even superior performance compared to fully fine-tuning.
  • Figure 2: Overall architecture of M2IST. M2IST freezes the pre-trained Vision Encoder (blue branch) and Language Encoder (green branch), while updating M3ISAs on side networks (pink branch). M3ISAs comprise IEA for bridging the pre-trained dual encoders to enable cross-modality interactions, and VEA/LEA for transferring pre-trained single-modality representations to adapt to the REC domain. By avoiding backpropagation through the heavy encoders (red dashed arrow), M2IST enables parameter-, memory-, and time-efficient tuning for the task of referring expression comprehension.
  • Figure 3: Different adapter insertion forms. During fine-tuning, gradients in (a) and (b) backpropagate through the heavy encoders, while gradients in (c) only backpropagate through the lightweight adapters, achieving memory-efficient tuning for REC. Note that (b) and (c) only illustrate the vision branch for simplicity.
  • Figure 4: Visualizations of attention maps from the V-L Encoder with different mixing strategies. Cases include object appearance attributes (blue words), human actions (green words), and spatial relations (red words).