Table of Contents
Fetching ...

IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

Markus Gross, Aya Fahmy, Danit Niwattananan, Dominik Muhle, Rui Song, Daniel Cremers, Henri Meeß

TL;DR

IPFormer introduces context-adaptive instance proposals for vision-based 3D Panoptic Scene Completion, addressing limitations of static Transformer queries by initializing and refining proposals from image context at both train and test time. The method lifts 2D image features into a probabilistic 3D context, initializes instance and voxel proposals via visibility-aware sampling and deformable attention, and then performs a dual-stage encoding/decoding that first learns semantic completion and then panoptic completion. Through a two-stage, dual-head training objective and a principled instance-voxel alignment, IPFormer achieves state-of-the-art in-domain PSC performance, strong zero-shot generalization, and over 14x runtime reduction, demonstrating the viability of context-adaptive proposals for visual 3D scene understanding. The results indicate a significant advance for privacy-preserving, camera-based 3D perception in autonomous driving and robotics, with practical impact on real-time, holistic scene reconstruction.

Abstract

Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first method that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Extensive experimental results show that our approach achieves state-of-the-art in-domain performance, exhibits superior zero-shot generalization on out-of-domain data, and achieves a runtime reduction exceeding 14x. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.

IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

TL;DR

IPFormer introduces context-adaptive instance proposals for vision-based 3D Panoptic Scene Completion, addressing limitations of static Transformer queries by initializing and refining proposals from image context at both train and test time. The method lifts 2D image features into a probabilistic 3D context, initializes instance and voxel proposals via visibility-aware sampling and deformable attention, and then performs a dual-stage encoding/decoding that first learns semantic completion and then panoptic completion. Through a two-stage, dual-head training objective and a principled instance-voxel alignment, IPFormer achieves state-of-the-art in-domain PSC performance, strong zero-shot generalization, and over 14x runtime reduction, demonstrating the viability of context-adaptive proposals for visual 3D scene understanding. The results indicate a significant advance for privacy-preserving, camera-based 3D perception in autonomous driving and robotics, with practical impact on real-time, holistic scene reconstruction.

Abstract

Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first method that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Extensive experimental results show that our approach achieves state-of-the-art in-domain performance, exhibits superior zero-shot generalization on out-of-domain data, and achieves a runtime reduction exceeding 14x. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.

Paper Structure

This paper contains 26 sections, 12 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Comparison of query initialization for Panoptic Scene Completion (PSC). Existing methods, e.g. Symphonies symphonies, randomly initialize instance queries and incorporate context-awareness during training. However, these queries retain their static nature at test time, as they are shared across all inputs. Our method IPFormer initializes them as instance proposals, which preserve context-adaptivity at test time, effectively aggregating directed features for improved PSC performance. Due to the the novelty of vision-based PSC and the absence of established baselines, we apply DBSCAN ester1996_dbscan clustering to Symphonies' SSC output to retrieve its individual instances.
  • Figure 2: Detailed architecture of IPFormer. Our method refines image features and a depth map to produce 3D context features, which are sampled based on visibility to generate context-adaptive instance and voxel proposals. In a two-stage training strategy, voxel proposals first handle Semantic Scene Completion, guiding the latent space toward detailed geometry and semantics. The second stage attends instance proposals over the pretrained voxel features to register individual instances. This dual-head design aligns semantics, instances and voxels, enabling robust Panoptic Scene Completion.
  • Figure 3: Instance-specific saliency. Through gradient-based attribution, we derive saliency maps that highlight image regions in green, where an individual instance mainly retrieves context from. Our introduced instance proposals effectively adapt to scene characteristics by guiding feature aggregation, substantially improving identification, classification, and completion. In contrast, instance queries sample context in an undirected manner, causing misclassification and geometric ambiguity.
  • Figure 4: Qualitative results on the SemanticKITTI val. set semantickitti. Each top row illustrates purely semantic information, following the SSC color map. Each bottom row displays individual instances, with randomly assigned colors to facilitate differentiation. Note that we specifically show instances of the Thing-category for clarity.
  • Figure 5: Additional qualitative results on the SemanticKITTI val. set semantickitti. Each top row illustrates purely semantic information, following the SSC color map. Each bottom row displays individual instances, with randomly assigned colors to facilitate differentiation. Note that we specifically show instances of the Thing category for clarity.