Table of Contents
Fetching ...

CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation

Zhuoyan Luo, Yinghao Wu, Tianheng Cheng, Yong Liu, Yicheng Xiao, Hongfa Wang, Xiao-Ping Zhang, Yujiu Yang

TL;DR

A hierarchical encoding framework for GRES which outperforms state-of-the-art GRES methods by a remarkable margin, and incorporates the counting ability by embodying multiple/single/non-target scenarios into count- and category-level supervision, facilitating comprehensive object perception.

Abstract

The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios. Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification. However, these approaches tend to encode multi-granularity object information into a single representation, which makes it difficult to precisely represent comprehensive objects of different granularity. Moreover, the simple binary object-existence identification across all referent scenarios fails to specify their inherent differences, incurring ambiguity in object understanding. To tackle the above issues, we propose a \textbf{Co}unting-Aware \textbf{H}ierarchical \textbf{D}ecoding framework (CoHD) for GRES. By decoupling the intricate referring semantics into different granularity with a visual-linguistic hierarchy, and dynamic aggregating it with intra- and inter-selection, CoHD boosts multi-granularity comprehension with the reciprocal benefit of the hierarchical nature. Furthermore, we incorporate the counting ability by embodying multiple/single/non-target scenarios into count- and category-level supervision, facilitating comprehensive object perception. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of CoHD which outperforms state-of-the-art GRES methods by a remarkable margin. Code is available at \href{https://github.com/RobertLuo1/CoHD}{here}.

CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation

TL;DR

A hierarchical encoding framework for GRES which outperforms state-of-the-art GRES methods by a remarkable margin, and incorporates the counting ability by embodying multiple/single/non-target scenarios into count- and category-level supervision, facilitating comprehensive object perception.

Abstract

The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios. Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification. However, these approaches tend to encode multi-granularity object information into a single representation, which makes it difficult to precisely represent comprehensive objects of different granularity. Moreover, the simple binary object-existence identification across all referent scenarios fails to specify their inherent differences, incurring ambiguity in object understanding. To tackle the above issues, we propose a \textbf{Co}unting-Aware \textbf{H}ierarchical \textbf{D}ecoding framework (CoHD) for GRES. By decoupling the intricate referring semantics into different granularity with a visual-linguistic hierarchy, and dynamic aggregating it with intra- and inter-selection, CoHD boosts multi-granularity comprehension with the reciprocal benefit of the hierarchical nature. Furthermore, we incorporate the counting ability by embodying multiple/single/non-target scenarios into count- and category-level supervision, facilitating comprehensive object perception. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of CoHD which outperforms state-of-the-art GRES methods by a remarkable margin. Code is available at \href{https://github.com/RobertLuo1/CoHD}{here}.
Paper Structure (47 sections, 11 equations, 8 figures, 13 tables)

This paper contains 47 sections, 11 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Comparison of decoding paradigms. Previous GRES methods attempt to integrate multi-granularity context into one joint feature for mask and object-existence prediction (a), which exhibits vulnerability to the complexity of multiple/non-target scenarios (c). Instead of relying on the single representation, we decouple the referring correspondence into the semantic hierarchy of different granularity and perform dynamic selective aggregation for robust segmentation. Further, in contrast to simple binary classification, we empower CoHD with counting ability to promote object perception (b).
  • Figure 2: Overview of CoHD. The model takes an image with the referring expression as input. After the encoding process, the Hierarchical Semantic Decoder (HSD) decouples the referring semantics into different granularity for establishing visual-linguistic hierarchy and then performs dynamic aggregation with intra- and inter-selection for final segmentation. Moreover, with well-aligned semantic contexts, we introduce the Adaptive Object Counting module (AOC) to fully promote comprehensive object perception with the category-specific counting ability under multiple/single/non-target scenarios.
  • Figure 3: Illustration of Hierarchical Semantic Decoder. The Semantic Decoding Module takes visual and linguistic features as input and enables bi-directional modality calibration to generate the corresponding fine-grained semantic map at each level, which formulates a visual-linguist hierarchy. Subsequently, we incorporate the referring semantics into the query level with the original language information, which facilitates a more holistic object understanding. With the well-aligned semantic hierarchy, we perform dynamic aggregation with the intra- and inter-selection mechanism for mask decoding to excavate the potential of reciprocal benefits brought by the hierarchical nature.
  • Figure 4: Segmentation Results. (a) and (b) are input images and segmentation results of CoHD with different referring expressions. The term Count specifies the output of AOC module.
  • Figure 5: Visualization comparison in generalized setting. (a) and (b) are the segmentation results of ReLA and CoHD, respectively. The term Count specifies the output of AOC module.
  • ...and 3 more figures