Table of Contents
Fetching ...

Visual Grounding with Multi-modal Conditional Adaptation

Ruilin Yao, Shengwu Xiong, Yichen Zhao, Yi Rong

TL;DR

Multi-modal Conditional Adaptation (MMCA) is introduced, which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions and integrating information from different modalities to obtain multi-modal embeddings.

Abstract

Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: https://github.com/Mr-Bigworth/MMCA.

Visual Grounding with Multi-modal Conditional Adaptation

TL;DR

Multi-modal Conditional Adaptation (MMCA) is introduced, which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions and integrating information from different modalities to obtain multi-modal embeddings.

Abstract

Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: https://github.com/Mr-Bigworth/MMCA.
Paper Structure (16 sections, 13 equations, 6 figures, 5 tables)

This paper contains 16 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Traditional visual grounding framework with independent visual encoder. (b) Our proposed visual grounding framework with Multi-modal (MM) conditional visual encoder. We visualize the ground truth and the attention maps of various visual encoders. The attention distribution of the independent visual encoder appears more diffuse, whereas the attention distributions of the MM-conditional visual encoder are more concentrated on the corresponding object.
  • Figure 2: (a) The parameters or the inference pipeline of the visual encoder are dynamically modified according to the textual feature. (b) Integrating textual and visual features through finely designed attention modules. (c) LoRA uses the additional trainable low-rank parameter matrices to simulate weight updates in transfer learning. (d) MMCA utilizes multi-modal information to control a set of update matrices for the visual encoder to realize language-guided visual feature extraction.
  • Figure 3: Overview of our proposed Multi-modal Conditional Adaption framework. We obtain a multi-modal embedding from visual and textual features and input it into different layers of the visual encoder to reorganize a set of weight update for the visual encoder. The figure shows the conditional weight update for the self-attention layer (query and key) and convolution layer in the visual transformer and CNN backbone.
  • Figure 4: The gated fusion of visual and textual features.
  • Figure 5: Visualization of input images and referring expressions, the attention maps of the transformer encoder layer in TransVG and MMCA, our prediction results (red bounding boxes) and ground truth (yellow bounding boxes).
  • ...and 1 more figures