Table of Contents
Fetching ...

Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

Yang Liu, Daizong Liu, Wei Hu

TL;DR

This paper proposes a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency, and introduces a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector.

Abstract

This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in the second stage, we introduce a top-down based proposal consolidation module, which utilizes graph design to effectively aggregate and propagate the query-related object contexts among the generated proposals for further refinement. By jointly training these two modules, we can avoid the inherent drawbacks of the complex proposals in the top-down framework and the coarse proposals in the bottom-up framework. Experimental results on the ScanRefer benchmark show that our framework is able to achieve the state-of-the-art performance.

Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

TL;DR

This paper proposes a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency, and introduces a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector.

Abstract

This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in the second stage, we introduce a top-down based proposal consolidation module, which utilizes graph design to effectively aggregate and propagate the query-related object contexts among the generated proposals for further refinement. By jointly training these two modules, we can avoid the inherent drawbacks of the complex proposals in the top-down framework and the coarse proposals in the bottom-up framework. Experimental results on the ScanRefer benchmark show that our framework is able to achieve the state-of-the-art performance.

Paper Structure

This paper contains 15 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a): Typical procedure of top-down based method. (b): Typical procedure of bottom-up based method. (c): Procedure of our proposed method, where we generate initial proposals in an efficient bottom-up manner, and subsequently consolidate the proposals over graphs via an effective top-down approach.
  • Figure 2: The pipeline of our proposed method. Initially, we encode the input 3D point cloud and text with pre-trained encoders. In the bottom-up stage, our module fuses these features for language-guided object proposals. In the top-down stage, our refinement module enhances these proposals by graph-based features, followed by predicting matching scores to select the best-matching bounding box.
  • Figure 3: Ablation study on the number $K$ of points selected during the farthest point sampling step in the bottom-up based proposal generation module.
  • Figure 4: Ablation study on the number $n$ of graph-based information aggregation iterations.
  • Figure 5: Visualization of our method. The first column displays the ground truth bounding boxes provided by the ScanRefer dataset. The second and third columns represent the output results of our bottom-up based proposal generation module and the final output of the entire model, respectively. The last column shows the results obtained from the 3D-SPS method.