Dictionary-based Framework for Interpretable and Consistent Object Parsing
Tiezheng Zhang, Qihang Yu, Alan Yuille, Ju He
TL;DR
CoCal addresses the challenge of interpretable and consistent object parsing by introducing a dictionary-based mask transformer with hierarchical dictionaries that align parts with objects. It learns discriminative dictionary components via component-wise contrastive learning and enforces cross-level logical consistency through dedicated losses, complemented by a post-processing step that enforces part-to-object consistency. The framework replaces traditional object queries with a fixed dictionary tied to class labels, enabling parameter-free nearest-neighbor inference at test time and clearer interpretability. Empirically, CoCal achieves state-of-the-art performance on PartImageNet and Pascal-Part-108, while also improving object-level metrics, and it demonstrates strong generalizability across different segmentation backbones. Overall, CoCal provides a principled, interpretable approach to hierarchical object parsing that leverages semantic structure to boost both part-level accuracy and global segmentation quality.
Abstract
In this work, we present CoCal, an interpretable and consistent object parsing framework based on dictionary-based mask transformer. Designed around Contrastive Components and Logical Constraints, CoCal rethinks existing cluster-based mask transformer architectures used in segmentation; Specifically, CoCal utilizes a set of dictionary components, with each component being explicitly linked to a specific semantic class. To advance this concept, CoCal introduces a hierarchical formulation of dictionary components that aligns with the semantic hierarchy. This is achieved through the integration of both within-level contrastive components and cross-level logical constraints. Concretely, CoCal employs a component-wise contrastive algorithm at each semantic level, enabling the contrasting of dictionary components within the same class against those from different classes. Furthermore, CoCal addresses logical concerns by ensuring that the dictionary component representing a particular part is closer to its corresponding object component than to those of other objects through a cross-level contrastive learning objective. To further enhance our logical relation modeling, we implement a post-processing function inspired by the principle that a pixel assigned to a part should also be assigned to its corresponding object. With these innovations, CoCal establishes a new state-of-the-art performance on both PartImageNet and Pascal-Part-108, outperforming previous methods by a significant margin of 2.08% and 0.70% in part mIoU, respectively. Moreover, CoCal exhibits notable enhancements in object-level metrics across these benchmarks, highlighting its capacity to not only refine parsing at a finer level but also elevate the overall quality of object segmentation.
