Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Xuan Yu; Yuxuan Xie; Yili Liu; Haojian Lu; Rong Xiong; Yiyi Liao; Yue Wang

Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Xuan Yu, Yuxuan Xie, Yili Liu, Haojian Lu, Rong Xiong, Yiyi Liao, Yue Wang

TL;DR

PanopticRecon++ introduces a fully end-to-end open-vocabulary panoptic reconstruction framework that casts 3D instances as learnable Gaussian-modulated queries within a cross-attention neural field. By fusing 3D spatial priors with segmentation cues and employing a parameter-free panoptic head, it achieves consistent semantic and instance segmentation across views while delivering high-quality geometry and novel-view synthesis. The method dynamically adjusts the number of instance tokens and leverages Hungarian assignment for cross-frame 2D–3D ID alignment, enabling robust open-world object understanding without 3D bounding boxes. Across Replica, ScanNet-V2, ScanNet++, and KITTI-360, PanopticRecon++ shows competitive or superior performance in 2D/3D segmentation and reconstruction, with practical implications for embodied robotics and photorealistic simulation.

Abstract

Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/

Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

TL;DR

Abstract

Paper Structure (35 sections, 40 equations, 15 figures, 8 tables)

This paper contains 35 sections, 40 equations, 15 figures, 8 tables.

Introduction
Related Works
Close-vocabulary Segmentation Reconstruction
Open-vocabulary Segmentation
Open-vocabulary Segmentation Reconstruction
A Cross-attention Perspective
Analysis of 3D Fields in Mask-lifting
System Overview
Instance Branch Design
3D Gaussian-modulated Query Token
Cross-Attention Design
Dynamic Token Adjustment
End-to-end Panoptic Reconstruction
Neural Image Synthesis
Panoptic Image Synthesis
...and 20 more sections

Figures (15)

Figure 1: End-to-end open-vocabulary panoptic reconstruction by 2D foundation model faces three challenges: 1) Misalignment: 2D instance IDs across frames are not aligned. 2) Ambiguity: Due to the limited FoV, two objects that never co-occur in a single image can be the same or different instances. 3) Inconsistency: The semantic and instance segmentations obtained from two separated heads are inconsistent. We align 2D instance IDs by instance tokens linear assignment, eliminate the ambiguity of 3D instances by incorporating spatial prior, and output consistent semantic and instance masks by a parameter-free panoptic head, generating the geometric mesh with panoptic masking that allows for multi-branch novel-view synthesis.
Figure 2: The input to PanopticRecon++ is posed RGB-D and segmentation images generated by Grounded SAM ren2024grounded. The field representation comprises appearance, SDF, semantics, and instances. Appearance leverages 3DGS kerbl20233dgs, and we use three hierarchical hashed encoding models muller2022instant for SDF, semantics, and instances. The radiance field is supervised by RGB loss and depth loss. The geometry field is supervised by depth loss and SDF loss. The probabilities of the output stuff class from the semantic field and the instance probabilities computed from the instance field and instance tokens through cross-attention are concatenated under Bayes' rule to form the panoptic (Pan.) probability. The segmentation images generated by Grounded SAM directly supervise semantic, instance, and panoptic probabilities. Finally, PanopticRecon++ outputs high-quality panoptic mesh, point cloud, and multi-branch novel-view synthesis.
Figure 3: Instance branch design by cross-attention between 3D Gaussian-modulated query token and spatial hashing encoded scene fields.
Figure 4: A visualization of the contribution of various attributes of instance tokens to segmentation. Instance classes are visualized by semantic label colors on the mean points of instance tokens. The intensity of the spatial prior is visualized using a color gradient, with red indicating the highest intensity, decreasing outwards.
Figure 5: The architecture of the panoptic segmentation head in both training and inference stages. During training, $L_{sem}$ and $L_{ins}$ maintain the semantic branch () and instance branch (), respectively. The semantic classes of the instance tokens are supervised by $L_{C}$. The parameter-free panoptic head, derived from the fusion of the semantic and instance branches, is trained using $L_{pan}$, enabling direct prediction of panoptic probability. During inference, semantic and instance segmentation results are directly derived from the panoptic segmentation output of the parameter-free panoptic head.
...and 10 more figures

Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

TL;DR

Abstract

Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (15)