Table of Contents
Fetching ...

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh

TL;DR

This work tackles multi-object 3D grounding from point clouds by introducing D-LISA, a two-stage framework that combines dynamic box proposals, a scene-aware dynamic multi-view renderer for 2D features, and a language-informed spatial attention mechanism (LISA) to reason about spatial relations. The dynamic vision module selects a variable number of proposals via a learned probability ${\alpha}_m$, while the renderer learns scene-conditioned camera poses to improve auxiliary features; LISA then fuses visual and language cues with a distance-based spatial matrix to produce per-box grounding scores. Training optimizes a composite loss over detection, grounding, contrastive, and dynamic-proposal terms, and final predictions are obtained by thresholding $p_n$ for each proposal. Empirically, D-LISA achieves a $12.8\%$ absolute improvement over the prior state-of-the-art on Multi3DRefer, while also delivering strong single-object grounding results on ScanRefer and Nr3D, demonstrating effectiveness for both multi-object grounding and practical robotics applications with modest computational overhead.

Abstract

Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

TL;DR

This work tackles multi-object 3D grounding from point clouds by introducing D-LISA, a two-stage framework that combines dynamic box proposals, a scene-aware dynamic multi-view renderer for 2D features, and a language-informed spatial attention mechanism (LISA) to reason about spatial relations. The dynamic vision module selects a variable number of proposals via a learned probability , while the renderer learns scene-conditioned camera poses to improve auxiliary features; LISA then fuses visual and language cues with a distance-based spatial matrix to produce per-box grounding scores. Training optimizes a composite loss over detection, grounding, contrastive, and dynamic-proposal terms, and final predictions are obtained by thresholding for each proposal. Empirically, D-LISA achieves a absolute improvement over the prior state-of-the-art on Multi3DRefer, while also delivering strong single-object grounding results on ScanRefer and Nr3D, demonstrating effectiveness for both multi-object grounding and practical robotics applications with modest computational overhead.

Abstract

Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

Paper Structure

This paper contains 16 sections, 14 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Illustration of the overall pipeline. Our D-LISA processes the 3D point cloud through the dynamic visual module (Sec. \ref{['sec:dynamic_vis']}) and encodes the text description through a text encoder. The visual and word features are fused through a language informed spatial fusion module (Sec. \ref{['sec:spatial_fus']}).
  • Figure 2: Illustration of language informed spatial attention (LISA). We model the object relations through spatial distance ${\bm{D}}$. For each box proposal, a spatial score is predicted to balance the visual attention weights and spatial relations.
  • Figure 3: Qualitative examples of Multi3DRefer val set. For each scene-text pair, we visualize the predictions of M3DRef-CLIP, M3DRef-CLIP w/NMS, D-LISA and ground truth labels in magenta/blue/green/red separately.
  • Figure 4: Qualitative results of dynamic multi-view renderer. On the left, we show the learned pose distribution over the Multi3DRefer val set and visualize one camera ray example. On the right, we present examples of comparison between rendering with fixed pose and dynamic learned pose.
  • Figure A1: Additional qualitative examples of Multi3DRefer val set in MT category. For each scene-text pair, we visualize the predictions of M3DRef-CLIP, M3DRef-CLIP w/NMS, D-LISA and ground truth labels in magenta/blue/green/red separately.
  • ...and 1 more figures