RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner
Ying Zang, Chenglong Fu, Runlong Cao, Didi Zhu, Min Zhang, Wenjun Hu, Lanyun Zhu, Tianrun Chen
TL;DR
RESMatch tackles the data-efficiency challenge in referring expression segmentation by introducing the first semi-supervised learning framework for RES. It leverages a weak-to-strong consistency paradigm extended with three core adaptations: revised strong perturbation, correlated text augmentation, and model adaptive guidance that accounts for pseudo-label quality. Empirical results on RefCOCO, RefCOCO+, and RefCOCOg show substantial gains over fully supervised baselines at low labeling ratios and competitive performance with significant data efficiency, including reaching 87% of fully supervised accuracy with 10% labels on RefCOCO testA. The work establishes a foundation for SSL in complex vision-language grounding tasks and points to future directions in robust multimodal SSL.
Abstract
Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES, aimed at reducing reliance on exhaustive data annotation. Extensive validation on multiple RES datasets demonstrates that RESMatch significantly outperforms baseline approaches, establishing a new state-of-the-art. Although existing SSL techniques are effective in image segmentation, we find that they fall short in RES. Facing the challenges including the comprehension of free-form linguistic descriptions and the variability in object attributes, RESMatch introduces a trifecta of adaptations: revised strong perturbation, text augmentation, and adjustments for pseudo-label quality and strong-weak supervision. This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.
