Table of Contents
Fetching ...

Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers

Chengyi Du, Keyan Jin

TL;DR

The paper tackles multi-object grounding in 3D scenes by introducing Hierarchical Contrastive Siamese Transformers (H-COST). It combines a hierarchical refinement strategy with a contrastive Siamese framework of two identical networks (auxiliary grounded in ground-truth semantics and inference on segmented point-clouds) to progressively localize multiple objects while aligning intermediate representations. Key contributions include a hierarchical loss with distance thresholds $δ_s$, and a Siamese contrastive objective that jointly optimizes alignment losses $L_{align}^A$, $L_{align}^H$ and a distinctiveness loss $L_{distinct}$, yielding $L_{siam\_contra} = \alpha L_{distinct} + L_{align}^A + L_{align}^H$. On Multi3DRefer, H-COST achieves a 9.5% improvement over prior best methods, demonstrating strong gains in complex multi-object scenarios and robust single-object performance for real-world applicability, with potential impact on robotics and AR/VR scene understanding. The approach leverages spatial-aware and cross-attention within transformer fusion blocks to effectively fuse language and 3D geometry, enabling precise grounding in cluttered 3D environments.

Abstract

Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks.

Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers

TL;DR

The paper tackles multi-object grounding in 3D scenes by introducing Hierarchical Contrastive Siamese Transformers (H-COST). It combines a hierarchical refinement strategy with a contrastive Siamese framework of two identical networks (auxiliary grounded in ground-truth semantics and inference on segmented point-clouds) to progressively localize multiple objects while aligning intermediate representations. Key contributions include a hierarchical loss with distance thresholds , and a Siamese contrastive objective that jointly optimizes alignment losses , and a distinctiveness loss , yielding . On Multi3DRefer, H-COST achieves a 9.5% improvement over prior best methods, demonstrating strong gains in complex multi-object scenarios and robust single-object performance for real-world applicability, with potential impact on robotics and AR/VR scene understanding. The approach leverages spatial-aware and cross-attention within transformer fusion blocks to effectively fuse language and 3D geometry, enabling precise grounding in cluttered 3D environments.

Abstract

Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks.

Paper Structure

This paper contains 22 sections, 16 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Hierarchical Contrastive Siamese Transformers (H-COST) starts with broad localization (bottom) to identify general regions and refines to precise localization . Green arrows show inter-object comparisons for differentiating objects, while blue arrows indicate intra-object alignment for consistency. Red bounding boxes highlight objects refined through hierarchical processing.
  • Figure 2: Overall architecture of H-COST. This figure shows the integration of the auxiliary and inference networks. The auxiliary network processes ground-truth semantic features from text inputs, while the inference network extracts object features from raw point-cloud data. Both networks pass their respective features through transformer-based fusion blocks, which progressively refine the predictions using initial, progressive, and final refinement heads. The final grounded prediction is obtained through a grounding head that operates on the output of the final hidden state.
  • Figure 3: Qualitative results of M3DRef-CLIP versus H-COST on Multi3DRefer using predicted boxes. Brown boxes indicate object proposals, yellow boxes represent initially predicted objects, green boxes are refined predictions, red boxes are true positives with IoU threshold $\tau_{\text{pred}} > 0.5$, and black boxes denote missed objects.