From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

Hanrong Shi; Lin Li; Jun Xiao; Yueting Zhuang; Long Chen

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

Hanrong Shi, Lin Li, Jun Xiao, Yueting Zhuang, Long Chen

TL;DR

This work targets Panoptic Scene Graph Generation (PSG), where prior methods largely rely on bbox-based features and overlook shape cues. It introduces Curricular shApe-aware FEature (CAFE), a model-agnostic framework that injects mask and boundary shape-aware features and trains three predicate-specific classifiers in an easy-to-hard sequence with knowledge distillation. Through cognition-based predicate grouping, stage-wise feature fusion, and balanced sampling, CAFE achieves state-of-the-art robustness and strong zero-shot generalization on the PSG dataset, across PredCls and SGDet tasks. The results demonstrate that incorporating shape-aware representations of object contours and interactions significantly reduces semantic confusion and improves performance, while remaining computationally practical. This framework lays groundwork for broader applications, including potential extension to panoptic video scene graph generation.

Abstract

Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for PSG. Specifically, we incorporate shape-aware features (i.e., mask features and boundary features) into PSG, moving beyond reliance solely on bbox features. Furthermore, drawing inspiration from human cognition, we propose to integrate shape-aware features in an easy-to-hard manner. To achieve this, we categorize the predicates into three groups based on cognition learning difficulty and correspondingly divide the training process into three stages. Each stage utilizes a specialized relation classifier to distinguish specific groups of predicates. As the learning difficulty of predicates increases, these classifiers are equipped with features of ascending complexity. We also incorporate knowledge distillation to retain knowledge acquired in earlier stages. Due to its model-agnostic nature, CAFE can be seamlessly incorporated into any PSG model. Extensive experiments and ablations on two PSG tasks under both robust and zero-shot PSG have attested to the superiority and robustness of our proposed CAFE, which outperforms existing state-of-the-art methods by a large margin.

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

TL;DR

Abstract

Paper Structure (25 sections, 20 equations, 13 figures, 14 tables, 1 algorithm)

This paper contains 25 sections, 20 equations, 13 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Approach
Overview: Two-Stage PSG Approach
Shape-aware Feature Preparation
Shape-aware Features Extraction
Stage-wise Feature Fusion
Curricular Feature Training
Cognition-based Predicate Grouping
Classification Space Configuration
Predicate Sampling
Training Objectives and Inference
Experiments
Experimental Settings
Implementation Details
...and 10 more sections

Figures (13)

Figure 1: (a) Scene Graph Generation (SGG): It relies on the bounding box-based paradigm, which can lead to inaccurate object localization and limited background annotation. (b) Panoptic Scene Graph Generation (PSG): It presents a more comprehensive and cleaner scene representation, with more accurate localization of objects and including relationships with the background (e.g., fence and playingfield).
Figure 2: With the escalation of learning difficulty in predicates, there is a corresponding increase in the complexity of features necessary for accurately predicting pairwise relations between objects. These features include traditional bbox features and our proposed shape-aware features (i.e., mask and boundary features).
Figure 3: The pipeline of our proposed CAFE. (a) Shape-aware Feature Preparation: we generate three types of features, ranging in complexity from simple to complex. (b) Curricular Feature Learning: we divide the training process into three stages according to the predicate learning difficulty from easy to hard. Each stage utilizes the corresponding features and has its own relation classifier. The training objectives include cross-entropy (CE) loss and KL loss.
Figure 4: Different instantiations of feature fusion strategies in the Transformer-based CAFE.
Figure 5: A confusion matrix of our Motifs+CAFE model. The element $\mathcal{C}[r_i][r_j]$ means the number of samples labeled as predicate $r_i$ but predicted as $r_j$. For instance, $\mathcal{C}[12][13]$ corresponds to the number of instances where the GT label is "walking on", but the prediction was"running on".
...and 8 more figures

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

TL;DR

Abstract

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)