Table of Contents
Fetching ...

Visual Grounding with Attention-Driven Constraint Balancing

Weitai Kang, Luowei Zhou, Junyi Wu, Changchang Sun, Yan Yan

TL;DR

This work tackles visual grounding with transformer-based fusion by focusing on attention behavior rather than solely bounding-box regression. It introduces AttBalance, a framework that combines RAC and MRC to regulate language-modulated attention and a DAT scheme to adaptively scale losses, addressing data-imbalance issues from regulation. Across four benchmarks and multiple models, AttBalance yields consistent improvements, with QRNet achieving new state-of-the-art results and notable gains on harder datasets. The approach also demonstrates data efficiency in semi-supervised settings and shows robust qualitative improvements in attention localization. Overall, AttBalance provides a practical, plug-in method to align multi-modal attention with language guidance in visual grounding.

Abstract

Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt transformer-based models to fuse features from both modalities, further introducing various modules that modulate visual features to align with the language expressions and eliminate the irrelevant redundant information. However, their loss function, still adopting common Object Detection losses, solely governs the bounding box regression output, failing to fully optimize for the above objectives. To tackle this problem, in this paper, we first analyze the attention mechanisms of transformer-based models. Building upon this, we further propose a novel framework named Attention-Driven Constraint Balancing (AttBalance) to optimize the behavior of visual features within language-relevant regions. Extensive experimental results show that our method brings impressive improvements. Specifically, we achieve constant improvements over five different models evaluated on four different benchmarks. Moreover, we attain a new state-of-the-art performance by integrating our method into QRNet.

Visual Grounding with Attention-Driven Constraint Balancing

TL;DR

This work tackles visual grounding with transformer-based fusion by focusing on attention behavior rather than solely bounding-box regression. It introduces AttBalance, a framework that combines RAC and MRC to regulate language-modulated attention and a DAT scheme to adaptively scale losses, addressing data-imbalance issues from regulation. Across four benchmarks and multiple models, AttBalance yields consistent improvements, with QRNet achieving new state-of-the-art results and notable gains on harder datasets. The approach also demonstrates data efficiency in semi-supervised settings and shows robust qualitative improvements in attention localization. Overall, AttBalance provides a practical, plug-in method to align multi-modal attention with language guidance in visual grounding.

Abstract

Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt transformer-based models to fuse features from both modalities, further introducing various modules that modulate visual features to align with the language expressions and eliminate the irrelevant redundant information. However, their loss function, still adopting common Object Detection losses, solely governs the bounding box regression output, failing to fully optimize for the above objectives. To tackle this problem, in this paper, we first analyze the attention mechanisms of transformer-based models. Building upon this, we further propose a novel framework named Attention-Driven Constraint Balancing (AttBalance) to optimize the behavior of visual features within language-relevant regions. Extensive experimental results show that our method brings impressive improvements. Specifically, we achieve constant improvements over five different models evaluated on four different benchmarks. Moreover, we attain a new state-of-the-art performance by integrating our method into QRNet.
Paper Structure (18 sections, 5 equations, 4 figures, 7 tables)

This paper contains 18 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Y-axis represents the Spearman's rank correlation between the performance (IoU) of models (TransVG, VLTVG) and the summation of attention values within the ground truth bounding box across the majority of the evaluation dataset. The X-axis denotes attention derived from different layers. The lines of different colors represent different datasets.
  • Figure 2: Our AttBalance applied to the transformer-based pipeline. The Early Interaction module may exist in some transformer-based models, e.g., QRNet and VLTVG. The Rho-modulated Attention Constraint (RAC) and the Momentum Rectification Constraint (MRC) constitute our Attention Regularization to regulate the attention behavior, where the RAC uses the segmentation mask transferred from the bbox to supervise the attention map to focus on the language-related region while the MRC uses the attention map from a momentum version of the model to rectify the RAC. The Difficulty Weight is used to adaptively scale up the losses to mitigate the data imbalance problem brought by the Attention Regularization.
  • Figure 3: Imbalance study of attention value within ground truth region. Left: A histogram analysis of the attention values within the ground truth region of VLTVG's last layer. Right: The averaged attention values within the ground truth region of VLTVG's last layer under varying box ratios.
  • Figure 4: "a fluffy black cat sniffing around a bathroom sink". We visualize the attention map of each layer. The white box is the ground truth and the red box is the prediction.