Table of Contents
Fetching ...

Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

Zilin Du, Haoxin Li, Jianfei Yu, Boyang Li

TL;DR

This work tackles visual grounding in data-scarce settings by introducing POBF (Paint Outside the Box and Filter), a framework that synthesizes high-quality image–caption–bounding-box data using inpainting that preserves alignment by painting outside the ground-truth box. A novel filtering scheme combines hardness, overfitting, and a penalty term to select the most effective synthetic samples, and these are mixed with limited real data to train a student model. Across four benchmarks, POBF consistently improves over real-data-only baselines by an average of $+5.83\%$ and outperforms strong baselines by $2.29\%$–$3.85\%$ in accuracy, demonstrating robustness to different generative models, dataset sizes, and architectures. The method achieves significant practical impact by reducing reliance on densely labeled data and enabling flexible, scalable visual grounding in real-world settings.

Abstract

Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83\% over the real-data-only method and outperforming leading baselines by 2.29\%-3.85\% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.

Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

TL;DR

This work tackles visual grounding in data-scarce settings by introducing POBF (Paint Outside the Box and Filter), a framework that synthesizes high-quality image–caption–bounding-box data using inpainting that preserves alignment by painting outside the ground-truth box. A novel filtering scheme combines hardness, overfitting, and a penalty term to select the most effective synthetic samples, and these are mixed with limited real data to train a student model. Across four benchmarks, POBF consistently improves over real-data-only baselines by an average of and outperforms strong baselines by in accuracy, demonstrating robustness to different generative models, dataset sizes, and architectures. The method achieves significant practical impact by reducing reliance on densely labeled data and enabling flexible, scalable visual grounding in real-world settings.

Abstract

Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83\% over the real-data-only method and outperforming leading baselines by 2.29\%-3.85\% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.

Paper Structure

This paper contains 15 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our proposed framework POBF, which consists of four steps: data generation, teacher training, data filtering, and student training. The real data sample (image, caption, box) is denoted as $(I, T, B)$. $T'$ and $I'_j$ denote the generated caption and image, respectively. $\mathcal{G}_{T}$ and $\mathcal{G}_{I}$ are generative models. $\mathcal{T}$ and $\mathcal{S}$ refer to the teacher and student model, respectively.
  • Figure 2: Illustration of the inputs used to compute each score for a generated image. Here, $B$ and $\hat{B}$ denote the ground truth and predicted bounding box, respectively.
  • Figure 3: Scatter plot illustrating the relationship between unnormalized hardness score $S_1$ and overfitting score $S_2$ across different datasets. Each point presents a generated sample.
  • Figure 4: Qualitative examples illustrating the effectiveness of the two proposed scores. The green score indicates the synthetic image with the higher corresponding score.