Table of Contents
Fetching ...

Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang

TL;DR

This work tackles 3D visual grounding by proposing HAM, a end-to-end framework that learns hierarchical, multi-granularity visual and linguistic representations. Central to HAM is the PLACM mechanism, which fuses proposal-point visual features with word- and sentence-level language embeddings through context-modulated attention, while SMGM extends PLACM to global and local spatial fields. The model is enhanced by three prompt-engineering strategies and a concentration sampling scheme, and it is adaptable to both grounding-by-detection and identification paradigms. Empirically, HAM achieves state-of-the-art results on ScanRefer and competitive performance on ReferIt3D, winning the ECCV 2022 ScanRefer Challenge and demonstrating robust, interpretable grounding under diverse linguistic descriptions.

Abstract

This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner. We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation (PLACM) mechanism, which learns to gradually align word-level and sentence-level linguistic embeddings with visual representations, while the modulation with the visual context captures latent informative relationships. To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme that applies PLACM to both global and local fields. Experimental results demonstrate the superiority of HAM, with visualized results showing that it can dynamically model fine-grained visual and linguistic representations. HAM outperforms existing methods by a significant margin and achieves state-of-the-art performance on two publicly available datasets, and won the championship in ECCV 2022 ScanRefer challenge. Code is available at~\url{https://github.com/PPjmchen/HAM}.

Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

TL;DR

This work tackles 3D visual grounding by proposing HAM, a end-to-end framework that learns hierarchical, multi-granularity visual and linguistic representations. Central to HAM is the PLACM mechanism, which fuses proposal-point visual features with word- and sentence-level language embeddings through context-modulated attention, while SMGM extends PLACM to global and local spatial fields. The model is enhanced by three prompt-engineering strategies and a concentration sampling scheme, and it is adaptable to both grounding-by-detection and identification paradigms. Empirically, HAM achieves state-of-the-art results on ScanRefer and competitive performance on ReferIt3D, winning the ECCV 2022 ScanRefer Challenge and demonstrating robust, interpretable grounding under diverse linguistic descriptions.

Abstract

This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner. We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation (PLACM) mechanism, which learns to gradually align word-level and sentence-level linguistic embeddings with visual representations, while the modulation with the visual context captures latent informative relationships. To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme that applies PLACM to both global and local fields. Experimental results demonstrate the superiority of HAM, with visualized results showing that it can dynamically model fine-grained visual and linguistic representations. HAM outperforms existing methods by a significant margin and achieves state-of-the-art performance on two publicly available datasets, and won the championship in ECCV 2022 ScanRefer challenge. Code is available at~\url{https://github.com/PPjmchen/HAM}.
Paper Structure (24 sections, 8 equations, 11 figures, 9 tables)

This paper contains 24 sections, 8 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Demonstration of the proposed HAM framework on the ScanRefer benchmark. This example demonstrates HAM's ability to comprehend spatial relationships through free-form language and accurately localize targets in irregular point clouds.
  • Figure 2: Visualization of 2D and 3D visual grounding examples. While 2D visual grounding is typically performed on images, 3D visual grounding is more challenging, requiring a deeper understanding of the intricate spatial relationships, as well as the accompanying lengthy and complex language.
  • Figure 3: Workflow of the proposed hierarchical alignment model (HAM). The initial pre-processing and encoding of raw point clouds and query texts yield key points, proposal points, associated point features, proposal bounding boxes, as well as word-level and sentence-level embeddings. Two branches are developed for spatially multi-granular modeling (SMGM): one implementing point-language alignment with context modulation (PLACM) on the global field, and the other applying PLACM on local regions generated by space partitioning. The outputs of both branches are merged and passed through MLP layers to predict the final matching scores.
  • Figure 4: Three cumulative prompt engineering strategies: (1) word masking, (2) intra-sentence ensemble, and (3) inter-sentence ensemble.
  • Figure 5: Flowchart of the mechanism of point-language alignment with context modulation (PLACM).
  • ...and 6 more figures