Table of Contents
Fetching ...

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

TL;DR

This work proposes a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG, which consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm.

Abstract

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

TL;DR

This work proposes a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG, which consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm.

Abstract

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.
Paper Structure (22 sections, 15 equations, 8 figures, 11 tables)

This paper contains 22 sections, 15 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Visual attentions and grounding results of CLIP and the proposed HiVG. The attentions are perceived by the [CLS] token over vision tokens.
  • Figure 2: Schematic representation of the hierarchical multimodal fine-grained modulation framework.
  • Figure 3: HiLoRA and vanilla LoRA. (a) The vanilla LoRA learns the global low-rank matrix utilizing the entire set of pre-trained weights in a single round. (b) The proposed HiLoRA employs a hierarchical approach to adapt the pre-trained model in a progressive manner, thereby finely reducing the task gap between pre-training and transfer tasks.
  • Figure 4: Comparison between HiVG (base) and SOTA models, as well as the ablation study of HiVG on the main modules. (a) HiVG achieves significant energy efficiency advantages, 8.2$\times$ faster than TransVG++transvg++ while outperforming it on RefCOCO-val. (b) The computational complexity of HiVG is only 13.0$\%$ compared with TransVG++. (c) HiVG outperforms SOTA models in different expression lengths on RefCOCOg-test. (d) HiLoRA method brings significant performance gains to HiVG model.
  • Figure 5: Qualitative results of our HiVG and CLIP-VG models on RefCOCOg-val datasets. We present the prediction box with IoU (in cyan) and the ground truth box (in green) in a unified image to visually display the grounding accuracy.
  • ...and 3 more figures