CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Linhui Xiao; Xiaoshan Yang; Fang Peng; Ming Yan; Yaowei Wang; Changsheng Xu

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Ming Yan, Yaowei Wang, Changsheng Xu

TL;DR

This work proposes CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels that outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios.

Abstract

Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78$\%$ to 10.67$\%$ and 11.39$\%$ to 14.87$\%$, respectively. The results even outperform existing weakly supervised visual grounding methods. Furthermore, our method is also competitive in fully supervised setting. The code and models are available at https://github.com/linhuixiao/CLIP-VG.

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

TL;DR

Abstract

to 10.67

and 11.39

to 14.87

, respectively. The results even outperform existing weakly supervised visual grounding methods. Furthermore, our method is also competitive in fully supervised setting. The code and models are available at https://github.com/linhuixiao/CLIP-VG.

Paper Structure (20 sections, 18 equations, 11 figures, 9 tables, 2 algorithms)

This paper contains 20 sections, 18 equations, 11 figures, 9 tables, 2 algorithms.

Introduction
Related Work
Visual Grounding
Vision-Language Pre-trained Models
Curriculum Learning
Method
Task Definition
Network Architecture
Reliability Measurement
Single-source Self-paced Adapting (SSA)
Multi-source Self-paced Adapting (MSA)
Experiments
Implementation Details
Comparison with State-of-the-Art Methods
Ablation Study
...and 5 more sections

Figures (11)

Figure 1: Main idea of our proposed CLIP-VG, which adapts CLIP with pseudo-language labels in a self-paced curriculum adapting paradigm to realize the transfer learning in visual grounding.
Figure 2: Our CLIP-VG model architecture (\ref{['3.2framework']}) serves as a vision-language grounding model to realize the self-paced curriculum adapting of CLIP.
Figure 3: Self-paced curriculum adapting of CLIP by exploiting pseudo-language labels to realize the unsupervised visual grounding. (a) Examples of pseudo-language labels (The sources of different pseudo-language labels are described in \ref{['4.1detail']}, better view in zoom-in). (b) Single-source Self-paced Adapting (SSA) utilizes the vision-language grounding model (VLGM) to exploit the pseudo-template labels for reliability measurement and greedy sample selection to achieve a more stable adaption of the CLIP by finding reliable pseudo-labels. (c) Multi-source Self-paced Adapting (MSA) further proposes source-specific reliability (SR) and cross-source reliability (CR) based on SSA. It sequentially conducts pseudo-label sources selection, reliability measurer selection, and greedy sample selection to achieve an optimal balance between reliability and diversity.
Figure 4: The samples of the validation split in the RefCOCO/+/g dataset. The figure illustrates the characteristics of ground-truth query labels and grounding difficulty among the three datasets, with language entities highlighted in cyan.
Figure 5: The complete Source-specific Reliability (SR, shown in blue color) and Cross-source Reliability (CR, shown in teal color) Histograms, which are formed by scoring the three sources of pseudo-language labels in the interval (0.0, 1.0] with different Measurers. $\mathcal{M}_1, \mathcal{M}_2,\mathcal{M}_3$ represent the Reliability Measurers learned from pseudo-template labels, pseudo-relation labels, and pseudo-caption labels, respectively. Different sources contain distinctive distributions due to specific quality and language taxonomy of pseudo-language labels (i.e., (a1)-(b2)-(c3)), and the different Reliability Measurer has divergent discrimination abilities on the same pseudo-label sources (i.e., (a1)-(b1)-(c1)).
...and 6 more figures

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

TL;DR

Abstract

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

Authors

TL;DR

Abstract

Table of Contents

Figures (11)