Table of Contents
Fetching ...

Dictionary-based Framework for Interpretable and Consistent Object Parsing

Tiezheng Zhang, Qihang Yu, Alan Yuille, Ju He

TL;DR

CoCal addresses the challenge of interpretable and consistent object parsing by introducing a dictionary-based mask transformer with hierarchical dictionaries that align parts with objects. It learns discriminative dictionary components via component-wise contrastive learning and enforces cross-level logical consistency through dedicated losses, complemented by a post-processing step that enforces part-to-object consistency. The framework replaces traditional object queries with a fixed dictionary tied to class labels, enabling parameter-free nearest-neighbor inference at test time and clearer interpretability. Empirically, CoCal achieves state-of-the-art performance on PartImageNet and Pascal-Part-108, while also improving object-level metrics, and it demonstrates strong generalizability across different segmentation backbones. Overall, CoCal provides a principled, interpretable approach to hierarchical object parsing that leverages semantic structure to boost both part-level accuracy and global segmentation quality.

Abstract

In this work, we present CoCal, an interpretable and consistent object parsing framework based on dictionary-based mask transformer. Designed around Contrastive Components and Logical Constraints, CoCal rethinks existing cluster-based mask transformer architectures used in segmentation; Specifically, CoCal utilizes a set of dictionary components, with each component being explicitly linked to a specific semantic class. To advance this concept, CoCal introduces a hierarchical formulation of dictionary components that aligns with the semantic hierarchy. This is achieved through the integration of both within-level contrastive components and cross-level logical constraints. Concretely, CoCal employs a component-wise contrastive algorithm at each semantic level, enabling the contrasting of dictionary components within the same class against those from different classes. Furthermore, CoCal addresses logical concerns by ensuring that the dictionary component representing a particular part is closer to its corresponding object component than to those of other objects through a cross-level contrastive learning objective. To further enhance our logical relation modeling, we implement a post-processing function inspired by the principle that a pixel assigned to a part should also be assigned to its corresponding object. With these innovations, CoCal establishes a new state-of-the-art performance on both PartImageNet and Pascal-Part-108, outperforming previous methods by a significant margin of 2.08% and 0.70% in part mIoU, respectively. Moreover, CoCal exhibits notable enhancements in object-level metrics across these benchmarks, highlighting its capacity to not only refine parsing at a finer level but also elevate the overall quality of object segmentation.

Dictionary-based Framework for Interpretable and Consistent Object Parsing

TL;DR

CoCal addresses the challenge of interpretable and consistent object parsing by introducing a dictionary-based mask transformer with hierarchical dictionaries that align parts with objects. It learns discriminative dictionary components via component-wise contrastive learning and enforces cross-level logical consistency through dedicated losses, complemented by a post-processing step that enforces part-to-object consistency. The framework replaces traditional object queries with a fixed dictionary tied to class labels, enabling parameter-free nearest-neighbor inference at test time and clearer interpretability. Empirically, CoCal achieves state-of-the-art performance on PartImageNet and Pascal-Part-108, while also improving object-level metrics, and it demonstrates strong generalizability across different segmentation backbones. Overall, CoCal provides a principled, interpretable approach to hierarchical object parsing that leverages semantic structure to boost both part-level accuracy and global segmentation quality.

Abstract

In this work, we present CoCal, an interpretable and consistent object parsing framework based on dictionary-based mask transformer. Designed around Contrastive Components and Logical Constraints, CoCal rethinks existing cluster-based mask transformer architectures used in segmentation; Specifically, CoCal utilizes a set of dictionary components, with each component being explicitly linked to a specific semantic class. To advance this concept, CoCal introduces a hierarchical formulation of dictionary components that aligns with the semantic hierarchy. This is achieved through the integration of both within-level contrastive components and cross-level logical constraints. Concretely, CoCal employs a component-wise contrastive algorithm at each semantic level, enabling the contrasting of dictionary components within the same class against those from different classes. Furthermore, CoCal addresses logical concerns by ensuring that the dictionary component representing a particular part is closer to its corresponding object component than to those of other objects through a cross-level contrastive learning objective. To further enhance our logical relation modeling, we implement a post-processing function inspired by the principle that a pixel assigned to a part should also be assigned to its corresponding object. With these innovations, CoCal establishes a new state-of-the-art performance on both PartImageNet and Pascal-Part-108, outperforming previous methods by a significant margin of 2.08% and 0.70% in part mIoU, respectively. Moreover, CoCal exhibits notable enhancements in object-level metrics across these benchmarks, highlighting its capacity to not only refine parsing at a finer level but also elevate the overall quality of object segmentation.

Paper Structure

This paper contains 16 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of the proposed component-wise contrastive objectives. CoCal establishes two discriminative dictionaries at the part and object levels. Within the same semantic level, part/object components of the same classes are pulled closer ($\rightarrow\leftarrow$), while those of different classes are pushed apart ($\leftarrow\rightarrow$) (i.e., contrastive components). At the cross-semantic level, part components and their corresponding object components are pulled closer and vice versa (i.e. logical constraints).
  • Figure 2: Meta-architecture of the proposed CoCal. CoCal builds on top of an off-the-shelf clustering-based mask transformer, incorporating dictionary components that function as the cluster centers for each semantic class. Throughout training, the dictionary components in CoCal are updated via both mask-wise objectives from the transformer and contrastive objectives from the dictionary. During testing, CoCal adopts a straightforward inference approach by executing nearest neighbor search of the pixel features on the dictionary components.
  • Figure 3: Illustration of logical constraints at inference. In this picture, a reptile-head and reptile-body are wrongly predicted as the snake-head and snake-body, respectively. CoCal corrects the wrong prediction by computing the logical path probability through multiplying the part-level probability and object-level probability and re-assigns the labels along the path thus producing the correct part prediction.
  • Figure 4: Qualitative comparison for CoCal and kMaX-DeepLab on PartImageNet. Note that CoCal produces much more accurate object parsing results with precise boundaries (e.g., row 1) and fewer missed detections (e.g., row 2 & 3).