Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention
Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh
TL;DR
This work tackles multi-object 3D grounding from point clouds by introducing D-LISA, a two-stage framework that combines dynamic box proposals, a scene-aware dynamic multi-view renderer for 2D features, and a language-informed spatial attention mechanism (LISA) to reason about spatial relations. The dynamic vision module selects a variable number of proposals via a learned probability ${\alpha}_m$, while the renderer learns scene-conditioned camera poses to improve auxiliary features; LISA then fuses visual and language cues with a distance-based spatial matrix to produce per-box grounding scores. Training optimizes a composite loss over detection, grounding, contrastive, and dynamic-proposal terms, and final predictions are obtained by thresholding $p_n$ for each proposal. Empirically, D-LISA achieves a $12.8\%$ absolute improvement over the prior state-of-the-art on Multi3DRefer, while also delivering strong single-object grounding results on ScanRefer and Nr3D, demonstrating effectiveness for both multi-object grounding and practical robotics applications with modest computational overhead.
Abstract
Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.
