Table of Contents
Fetching ...

Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects

Manuel Gomes, Bogdan Raducanu, Miguel Oliveira

TL;DR

This work tackles 4D panoptic segmentation for articulated objects by introducing Artic4D, a synthetic but realistic benchmark with 4D sensor data and rich annotations. It then proposes CanonSeg4D, a segmentation framework that learns a canonical representation for each movable part, enabling articulation-invariant, temporally consistent part clustering via a PST-Transformer backbone, a semantic head, and a canonical module with offset-based losses. Extensive experiments on Artic4D show CanonSeg4D achieving superior $LSTQ$ scores, especially in highly articulated scenarios, outperforming state-of-the-art methods by leveraging temporal context and canonical alignment. The results demonstrate the strength of temporal modeling and canonical-space representations for dynamic object understanding, with implications for robotic manipulation and real-world perception pipelines.

Abstract

Articulated object perception presents significant challenges in computer vision, particularly because most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects. The use of 4D temporal data has not been thoroughly explored in articulated object perception and remains unexamined for panoptic segmentation. The lack of a benchmark dataset further hurt this field. To this end, we introduce Artic4D as a new dataset derived from PartNet Mobility and augmented with synthetic sensor data, featuring 4D panoptic annotations and articulation parameters. Building on this dataset, we propose CanonSeg4D, a novel 4D panoptic segmentation framework. This approach explicitly estimates per-frame offsets mapping observed object parts to a learned canonical space, thereby enhancing part-level segmentation. The framework employs this canonical representation to achieve consistent alignment of object parts across sequential frames. Comprehensive experiments on Artic4D demonstrate that the proposed CanonSeg4D outperforms state of the art approaches in panoptic segmentation accuracy in more complex scenarios. These findings highlight the effectiveness of temporal modeling and canonical alignment in dynamic object understanding, and pave the way for future advances in 4D articulated object perception.

Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects

TL;DR

This work tackles 4D panoptic segmentation for articulated objects by introducing Artic4D, a synthetic but realistic benchmark with 4D sensor data and rich annotations. It then proposes CanonSeg4D, a segmentation framework that learns a canonical representation for each movable part, enabling articulation-invariant, temporally consistent part clustering via a PST-Transformer backbone, a semantic head, and a canonical module with offset-based losses. Extensive experiments on Artic4D show CanonSeg4D achieving superior scores, especially in highly articulated scenarios, outperforming state-of-the-art methods by leveraging temporal context and canonical alignment. The results demonstrate the strength of temporal modeling and canonical-space representations for dynamic object understanding, with implications for robotic manipulation and real-world perception pipelines.

Abstract

Articulated object perception presents significant challenges in computer vision, particularly because most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects. The use of 4D temporal data has not been thoroughly explored in articulated object perception and remains unexamined for panoptic segmentation. The lack of a benchmark dataset further hurt this field. To this end, we introduce Artic4D as a new dataset derived from PartNet Mobility and augmented with synthetic sensor data, featuring 4D panoptic annotations and articulation parameters. Building on this dataset, we propose CanonSeg4D, a novel 4D panoptic segmentation framework. This approach explicitly estimates per-frame offsets mapping observed object parts to a learned canonical space, thereby enhancing part-level segmentation. The framework employs this canonical representation to achieve consistent alignment of object parts across sequential frames. Comprehensive experiments on Artic4D demonstrate that the proposed CanonSeg4D outperforms state of the art approaches in panoptic segmentation accuracy in more complex scenarios. These findings highlight the effectiveness of temporal modeling and canonical alignment in dynamic object understanding, and pave the way for future advances in 4D articulated object perception.

Paper Structure

This paper contains 24 sections, 17 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: 4D panoptic segmentation for articulated objects. Given a 3D model of an articulated object (a), a 4D point cloud sequence is generated (b) by capturing the object in different articulation states. State-of-the-art methods (c) transform each point (red arrows) to the centroid of its 4D space-time part (red circle), which shifts with articulation, as seen when the centroid of the closed door is closer to the cabinet body than the one of the open door, resulting in inconsistent reference points. The proposed method (d) transforms each point into a learned canonical space (green part), yielding consistent representations (green arrows) across articulation states.
  • Figure 2: Artic4D dataset generation pipeline. Joint trajectories defined by power-law and sigmoid profiles (Eqs. (\ref{['eq:traj-pow']}) and (\ref{['eq:traj-sig']})) are uniformly sampled at 100 articulation states (left panel). For each state, a set of RGB-D viewpoints uniformly distributed on a viewing sphere capture the object (middle panel). Depth maps from all viewpoints are fused into a single point cloud and downsampled via farthest point sampling (right panel).
  • Figure 3: Overview of CanonSeg4D, the proposed 4D panoptic segmentation architecture. The input is a 4D point cloud, which is processed by a feature extraction backbone to capture both spatial and temporal features. From those features, a segmentation head predicts semantic labels and the canonical module predicts instance labels. Both outputs are combined to achieve panoptic segmentation. The canonical module, in red, learns to transform each point into a canonical space representation, creating well-defined part centroids. These centroids are then used to group points into instances, using a clustering algorithm.
  • Figure 4: Radar plots comparing performance of the four methods (4D-StOP, Eq-4D-StOP, Mask4former, and CanonSeg4D) across different object categories in the Artic4D-M dataset: semantic segmentation performance ($S_{cls}$), instance association quality ($S_{assoc}$), and combined panoptic segmentation metric ($LSTQ$). Higher values indicate better performance.
  • Figure 5: Impact of input sequence length on CanonSeg4D's performance. The plot shows the $LSTQ$ score as a function of the number of frames per sequence, evaluated on the three subsets of the Artic4D dataset. Performance peaks at a sequence length of three frames across all subsets.
  • ...and 1 more figures