Table of Contents
Fetching ...

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe

TL;DR

AlignCAT tackles weakly supervised visual grounding by decomposing cross-modal alignment into category-focused coarse filtering and attribute-focused fine-grained matching. It leverages two shared semantic spaces and a three-stage query selection pipeline, driven by adaptive phrase attention and contrastive learning, to progressively filter visual queries and improve localization and segmentation. The approach achieves state-of-the-art results on REFCOCO, REFCOCO+, and REFCOCOg for both referring expression comprehension and segmentation, with notable gains when category information is integrated during training. By foregrounding linguistic cues from general to detailed levels, AlignCAT enhances cross-modal alignment, reduces interference from non-target objects, and demonstrates strong generalization in multi-object, complex-text scenarios.

Abstract

Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

TL;DR

AlignCAT tackles weakly supervised visual grounding by decomposing cross-modal alignment into category-focused coarse filtering and attribute-focused fine-grained matching. It leverages two shared semantic spaces and a three-stage query selection pipeline, driven by adaptive phrase attention and contrastive learning, to progressively filter visual queries and improve localization and segmentation. The approach achieves state-of-the-art results on REFCOCO, REFCOCO+, and REFCOCOg for both referring expression comprehension and segmentation, with notable gains when category information is integrated during training. By foregrounding linguistic cues from general to detailed levels, AlignCAT enhances cross-modal alignment, reduces interference from non-target objects, and demonstrates strong generalization in multi-object, complex-text scenarios.

Abstract

Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

Paper Structure

This paper contains 21 sections, 19 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of QueryMatch and the proposed AlignCAT. (a) QueryMatch fails to deal with category-based and attribute-based ambiguity in annotations. (b) AlignCAT progressively leverages linguistic cues from coarse (right-top) to fine (right-bottom) to filter visual queries, achieving category and attribute consistency.
  • Figure 2: AlignCAT framework overview. AlignCAT filters visual queries by hierarchically leveraging linguistic cues. The coarse-grained alignment module utilizes category and global information to discard category-inconsistent candidates. The fine-grained alignment module employs adaptive phrase attention to select the attribute-consistent visual query.
  • Figure 3: Visualization of adaptive phrase attention.
  • Figure 4: Visualization comparison of different selection designs of AlignCAT in weakly supervised REC. The red and green boxes are GT and predicted grounding results, respectively.
  • Figure 5: Visualization comparison to TRIS and QueryMatch for the weakly supervised RES task. The GT and predicted segmentation results are marked in red.
  • ...and 3 more figures