Table of Contents
Fetching ...

SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Sayan Nag, Koustava Goswami, Srikrishna Karanam

TL;DR

This work tackles weakly-supervised Referring Expression Segmentation (RES) by reducing annotation requirements for both masks and bounding boxes. SafaRi introduces an adaptive multi-modal sequence transformer with cross-modal fusion (X-FACt), Attention Mask Consistency Regularization (AMCR), a γ-scheduling bootstrapping pipeline, and Mask Validity Filtering (MVF) using SpARC for zero-shot REC–driven bounding boxes. A bootstrapping loop leverages pseudo-labels filtered by MVF to progressively enlarge labeled data and improve segmentation quality. Empirically, SafaRi achieves state-of-the-art results on RES benchmarks under weak supervision and demonstrates strong zero-shot generalization to Referring Video Object Segmentation, with substantial gains over fully-supervised SeqTR and Partial-RES at 30% annotation levels (e.g., 59.31 mIoU on RefCOCO+@testA versus 58.93 for SeqTR).

Abstract

Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks.

SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

TL;DR

This work tackles weakly-supervised Referring Expression Segmentation (RES) by reducing annotation requirements for both masks and bounding boxes. SafaRi introduces an adaptive multi-modal sequence transformer with cross-modal fusion (X-FACt), Attention Mask Consistency Regularization (AMCR), a γ-scheduling bootstrapping pipeline, and Mask Validity Filtering (MVF) using SpARC for zero-shot REC–driven bounding boxes. A bootstrapping loop leverages pseudo-labels filtered by MVF to progressively enlarge labeled data and improve segmentation quality. Empirically, SafaRi achieves state-of-the-art results on RES benchmarks under weak supervision and demonstrates strong zero-shot generalization to Referring Video Object Segmentation, with substantial gains over fully-supervised SeqTR and Partial-RES at 30% annotation levels (e.g., 59.31 mIoU on RefCOCO+@testA versus 58.93 for SeqTR).

Abstract

Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks.
Paper Structure (12 sections, 4 equations, 9 figures, 7 tables)

This paper contains 12 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: SafaRi achieves state-of-the-art performance in both weakly- and fully-supervised RES tasks. Although unlike Partial-RES, SafaRi is not pretrained on fully-supervised REC task, with just 30% annotations, SafaRi achieves 59.31 mIoU whereas Partial-RES and fully-supervised SeqTR obtains 58.16 and 58.93 mIoUs (see mIoU vs Label-Rate plot). In the weak-supervision setting, the inclusion of X-FACt (with cross-modal fusion and AMCR components) and SpARC modules aids SafaRi to demonstrate excellent grounding capabilities under challenging scenarios where Partial-RES fails (see qualitative examples). Quantitative results are provided in Tables \ref{['table:sotaonres']}-\ref{['table:davishmdb']}.
  • Figure 2: Overview of weakly-supervised bootstrapping setup. It includes an initial training stage, followed by inference and pseudo-labeling steps. Filtered pseudo-masks are added to the initial dataset and model is retrained in an iterative manner.
  • Figure 3: Architectural components of SafaRi. (i) We introduce X-FACt, composed of normalized gated cross-attention based Fused Feature Extractors and Attention Consistency Mask Regularization (AMCR) for enhancing cross-modal synergy and spatial localization of target objects. The fused output is subsequently fed to Sequence Transformer for prediction of contour points.(ii) We design Mask Validity Filtering (MVF) strategy for choosing valid pseudo-masks using SpARC module which is a Zero-Shot REC approach with spatial reasoning capabilities.
  • Figure 4: Qualitative differences between cross-attention maps and predicted masks in the presence and absence of AMCR. Without AMCR, some regions outside the object boundary are attended which affects the quality of predicted masks.
  • Figure 5: Impact of $\gamma$-scheduling under different initial values ($\gamma_{0}$) and AMCR balancing factor ($\lambda$) when evaluated on RefCOCO@val at 30% and 10% mask labels.
  • ...and 4 more figures