RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

Ying Zang; Chenglong Fu; Runlong Cao; Didi Zhu; Min Zhang; Wenjun Hu; Lanyun Zhu; Tianrun Chen

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

Ying Zang, Chenglong Fu, Runlong Cao, Didi Zhu, Min Zhang, Wenjun Hu, Lanyun Zhu, Tianrun Chen

TL;DR

RESMatch tackles the data-efficiency challenge in referring expression segmentation by introducing the first semi-supervised learning framework for RES. It leverages a weak-to-strong consistency paradigm extended with three core adaptations: revised strong perturbation, correlated text augmentation, and model adaptive guidance that accounts for pseudo-label quality. Empirical results on RefCOCO, RefCOCO+, and RefCOCOg show substantial gains over fully supervised baselines at low labeling ratios and competitive performance with significant data efficiency, including reaching 87% of fully supervised accuracy with 10% labels on RefCOCO testA. The work establishes a foundation for SSL in complex vision-language grounding tasks and points to future directions in robust multimodal SSL.

Abstract

Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation. Extensive validation on multiple RES datasets demonstrates that RESMatch significantly outperforms baseline approaches, establishing a new state-of-the-art. Although existing SSL techniques are effective in image segmentation, we find that they fall short in RES. Facing the challenges including the comprehension of free-form linguistic descriptions and the variability in object attributes, RESMatch introduces a trifecta of adaptations: revised strong perturbation, text augmentation, and adjustments for pseudo-label quality and strong-weak supervision. This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 7 figures, 8 tables)

This paper contains 22 sections, 7 equations, 7 figures, 8 tables.

Introduction
Related work
Method
Task Definition
RESMatch Network
Overview
Revisiting the Image Augmentation
Text Augmentation (TA)
Model Adaptive Guidance (MAG)
Experiments
Datasets
Evaluation Metrics
Implementation Details
Experimental Results
Comparison with Fully Supervised Settings
...and 7 more sections

Figures (7)

Figure 1: This work proposes RESMatch, the first Semi-Supervised Learning (SSL) pipeline for referring expression segmentation. We find that existing SSL approaches for image segmentation with added text encoder (FixMatch in this figure) cannot be directly applied to RES. They usually misunderstood regions defined by text (e.g., misidentification of regions, excessive or insufficient recognition).
Figure 2: Illustration of Our Proposed RESMatch. In RESMatch, labeled data $\{(I^x,T^x),Y^x\}$ is used to train the RES model $F$ by minimizing the supervised loss $L_{sup}$. Unlabeled data $(I^u,T^u)$, weakly augmented by $A^w(\cdot)$, is first fed into the model F̂ to obtain predictions $p^w$. The pseudo-label quality is based on the predicted confidence map, denoted as $s$. Based on the quality score, image strong augmentation $\phi$ is applied to the unlabeled data. Meanwhile, the semantic relevance filtering $\psi$ is applied with the strong text augmentation, denoted as $A^s(\cdot)$, The unsupervised loss $L_{unsup}$ is computed as the cross-entropy between $p^w$ and $p^s$, weighted by the quality of pseudo-labels $s$ to obtain predictions $p^s$ from the model $F$
Figure 3: Training curves of RESMatch and Supervised. RESMatch can consistently improve performance throughout the whole training period more effectively than Supervised.
Figure 4: Visualizations of RESMatch and Supervised. Sub-figure (a) shows the predictions of RESMatch are much better than Supervised. Sub-figure (b) indicates that both TA and MAG of RESMatch can obviously improve the quality of pseudo-labels.
Figure 5: Failure cases of RESMatch. RESMatch still fails in some hard examples. Such as abstract expressions and the ambiguity of text or images.
...and 2 more figures

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

TL;DR

Abstract

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

Authors

TL;DR

Abstract

Table of Contents

Figures (7)