Table of Contents
Fetching ...

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu

TL;DR

VGGT-Det is presented, the first framework tailored for SG-Free multi-view indoor 3D object detection, which integrates VGGT encoder into a transformer-based pipeline and introduces two novel key components: Attention-Guided Query Generation and Query-Driven Feature Aggregation.

Abstract

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

TL;DR

VGGT-Det is presented, the first framework tailored for SG-Free multi-view indoor 3D object detection, which integrates VGGT encoder into a transformer-based pipeline and introduces two novel key components: Attention-Guided Query Generation and Query-Driven Feature Aggregation.

Abstract

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
Paper Structure (16 sections, 16 equations, 8 figures, 9 tables)

This paper contains 16 sections, 16 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: To achieve and improve Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, VGGT-Det effectively leverages the internal semantic and geometric priors from VGGT wang2025vggt, rather than merely consuming its predictions. VGGT-Det significantly surpasses competitive methods in the SG-Free setting.
  • Figure 2: Overview of the proposed Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection framework, VGGT-Det. It is built upon the VGGT encoder wang2025vggt, which extracts 3D-aware features from multi-view images. The decoder processes a set of object queries that cross-attend to the extracted features, and iteratively updates the queries for final detection. To effectively leverage both the semantic and geometric priors from inside VGGT, we carefully design two key components: Attention-Guided Query Generation (AG) and Query-Driven Feature Aggregation (QD). AG utilizes the semantic priors from the VGGT encoder's attention to generate object queries, enabling these queries to focus on object regions while preserving the global spatial structure. Besides, QD introduces a learnable See-Query, which interacts with object queries via self-attention to 'see' their needs and dynamically aggregates multi-level geometric features accordingly.
  • Figure 3: Computation flow of Attention-Guided Query Generation.
  • Figure 4: The proposed Attention-Guided Query Generation (AG) is inspired by the interesting observation: attention maps from the VGGT encoder wang2025vggt exhibit a strong correlation with semantic content, even though VGGT is not explicitly trained for semantic tasks. For example, in the left column, object regions tend to receive higher attention weights. In the middle column, AG samples from VGGT-predicted point clouds under the guidance of attention weights and point distribution information. In the right column, compared to farthest point sampling without guidance (red points), the points sampled by AG (green points) are more concentrated in object regions (labeled by green boxes), resulting in more green points than red points in those areas. For clarity, we recommend viewing the figure in color and zooming in.
  • Figure 5: Visualization of validation losses. In the left subfigure, after applying AG, the GIoU loss is significantly lower than that of the baseline backbone, indicating that AG effectively improves object localization during training. In the right subfigure, as See-Query progressively learns, within a few epochs, to interact with object queries and to aggregate encoded geometric features effectively, the loss for 'AG+QD' becomes significantly lower than that for 'AG', highlighting the effectiveness of the proposed QD strategy.
  • ...and 3 more figures