Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Zilin Du, Haoxin Li, Jianfei Yu, Boyang Li
TL;DR
This work tackles visual grounding in data-scarce settings by introducing POBF (Paint Outside the Box and Filter), a framework that synthesizes high-quality image–caption–bounding-box data using inpainting that preserves alignment by painting outside the ground-truth box. A novel filtering scheme combines hardness, overfitting, and a penalty term to select the most effective synthetic samples, and these are mixed with limited real data to train a student model. Across four benchmarks, POBF consistently improves over real-data-only baselines by an average of $+5.83\%$ and outperforms strong baselines by $2.29\%$–$3.85\%$ in accuracy, demonstrating robustness to different generative models, dataset sizes, and architectures. The method achieves significant practical impact by reducing reliance on densely labeled data and enabling flexible, scalable visual grounding in real-world settings.
Abstract
Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83\% over the real-data-only method and outperforming leading baselines by 2.29\%-3.85\% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.
