Table of Contents
Fetching ...

Adaptive Masking Enhances Visual Grounding

Sen Jia, Lei Li

TL;DR

Experimental results consistently show that IMAGE outperforms baseline models, achieving enhanced generalization and improved performance in low-shot scenarios, highlighting the potential of adaptive feature manipulation through attention mechanisms and Gaussian modeling as a promising alternative to approaches that rely on the continual scaling of dataset sizes for the advancement of zero-shot and few-shot learning.

Abstract

In recent years, zero-shot and few-shot learning in visual grounding have garnered considerable attention, largely due to the success of large-scale vision-language pre-training on expansive datasets such as LAION-5B and DataComp-1B. However, the continuous expansion of these datasets presents significant challenges, particularly with respect to data availability and computational overhead, thus creating a bottleneck in the advancement of low-shot learning capabilities. In this paper, we propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, aimed at enhancing vocabulary grounding in low-shot learning scenarios without necessitating an increase in dataset size. Drawing inspiration from cognitive science and the recent success of masked autoencoders (MAE), our method leverages adaptive masking on salient regions of the feature maps generated by the vision backbone. This enables the model to learn robust, generalized representations through the reconstruction of occluded information, thereby facilitating effective attention to both local and global features. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks. Experimental results consistently show that IMAGE outperforms baseline models, achieving enhanced generalization and improved performance in low-shot scenarios. These findings highlight the potential of adaptive feature manipulation through attention mechanisms and Gaussian modeling as a promising alternative to approaches that rely on the continual scaling of dataset sizes for the advancement of zero-shot and few-shot learning. Our code is publicly available at https://github.com/git-lenny/IMAGE.

Adaptive Masking Enhances Visual Grounding

TL;DR

Experimental results consistently show that IMAGE outperforms baseline models, achieving enhanced generalization and improved performance in low-shot scenarios, highlighting the potential of adaptive feature manipulation through attention mechanisms and Gaussian modeling as a promising alternative to approaches that rely on the continual scaling of dataset sizes for the advancement of zero-shot and few-shot learning.

Abstract

In recent years, zero-shot and few-shot learning in visual grounding have garnered considerable attention, largely due to the success of large-scale vision-language pre-training on expansive datasets such as LAION-5B and DataComp-1B. However, the continuous expansion of these datasets presents significant challenges, particularly with respect to data availability and computational overhead, thus creating a bottleneck in the advancement of low-shot learning capabilities. In this paper, we propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, aimed at enhancing vocabulary grounding in low-shot learning scenarios without necessitating an increase in dataset size. Drawing inspiration from cognitive science and the recent success of masked autoencoders (MAE), our method leverages adaptive masking on salient regions of the feature maps generated by the vision backbone. This enables the model to learn robust, generalized representations through the reconstruction of occluded information, thereby facilitating effective attention to both local and global features. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks. Experimental results consistently show that IMAGE outperforms baseline models, achieving enhanced generalization and improved performance in low-shot scenarios. These findings highlight the potential of adaptive feature manipulation through attention mechanisms and Gaussian modeling as a promising alternative to approaches that rely on the continual scaling of dataset sizes for the advancement of zero-shot and few-shot learning. Our code is publicly available at https://github.com/git-lenny/IMAGE.
Paper Structure (31 sections, 2 theorems, 14 equations, 5 figures, 3 tables)

This paper contains 31 sections, 2 theorems, 14 equations, 5 figures, 3 tables.

Key Result

Lemma 1

Let $\hat{y}_{ij}$ be the predicted similarity between the masked image feature embedding and the corresponding text embedding in a batch. Let $y^*_{ij}$ be the optimal similarity that minimizes the IMAGE loss $L_{\text{IMAGE}}$. Then, with probability at least $1-\delta$, we have: where $\tau$ is the temperature hyperparameter, $\beta$ is the masking loss weight, and $N_{\text{batch}}$ is the ba

Figures (5)

  • Figure 1: Our IMAGE method is inspired by human perception; by masking key details of objects, we encourage the model to learn more robust representations.
  • Figure 2: Pipeline of IMAGE model, consisting of two blocks: attention prior generation module and RF-GAM mask generation module.
  • Figure 3: Scaling laws of our IMAGE model. With increased epochs, IMAGE achieves more accurate grounding AP across all four datasets and three settings.
  • Figure 4: Comparison between IMAGE with other strategies in different few-shot ratios
  • Figure 5: Results in different occlusion ratios on images across various methods.

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1: IMAGE Generalization Bound
  • proof