Table of Contents
Fetching ...

Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, Motoki Kawanabe

TL;DR

Cross3DVG tackles cross-dataset generalization in 3D visual grounding by introducing the RIORefer dataset built from 3RScan and evaluating zero-shot transfers to ScanRefer. The authors propose a CLIP-based multi-view baseline that fuses multi-view 2D image features with 3D proposals to bridge dataset gaps, and they systematically analyze how detectors, localization modules, and viewpoint diversity affect performance. Results show cross-dataset grounding is substantially harder than in-dataset grounding, but improvements from strong detectors and CLIP-guided fusion can reduce the gap. This work provides a robust benchmark and actionable insights for building more generalizable 3D grounding systems for diverse RGB-D sensing setups and real-world robotics/AR applications.

Abstract

We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG), which overcomes limitations of existing 3D visual grounding models, specifically their restricted 3D resources and consequent tendencies of overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with human annotations. After training the Cross3DVG model using the source 3D visual grounding dataset, we evaluate it without target labels using the target dataset with, e.g., different sensors, 3D reconstruction methods, and language annotators. Comprehensive experiments are conducted using established visual grounding models and with CLIP-based multi-view 2D and 3D integration designed to bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D visual grounding exhibits significantly worse performance than learning and evaluation with a single dataset because of the 3D data and language variants across datasets. Moreover, (ii) better object detector and localization modules and fusing 3D data and multi-view CLIP-based image features can alleviate this lower performance. Our Cross3DVG task can provide a benchmark for developing robust 3D visual grounding models to handle diverse 3D scenes while leveraging deep language understanding.

Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

TL;DR

Cross3DVG tackles cross-dataset generalization in 3D visual grounding by introducing the RIORefer dataset built from 3RScan and evaluating zero-shot transfers to ScanRefer. The authors propose a CLIP-based multi-view baseline that fuses multi-view 2D image features with 3D proposals to bridge dataset gaps, and they systematically analyze how detectors, localization modules, and viewpoint diversity affect performance. Results show cross-dataset grounding is substantially harder than in-dataset grounding, but improvements from strong detectors and CLIP-guided fusion can reduce the gap. This work provides a robust benchmark and actionable insights for building more generalizable 3D grounding systems for diverse RGB-D sensing setups and real-world robotics/AR applications.

Abstract

We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG), which overcomes limitations of existing 3D visual grounding models, specifically their restricted 3D resources and consequent tendencies of overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with human annotations. After training the Cross3DVG model using the source 3D visual grounding dataset, we evaluate it without target labels using the target dataset with, e.g., different sensors, 3D reconstruction methods, and language annotators. Comprehensive experiments are conducted using established visual grounding models and with CLIP-based multi-view 2D and 3D integration designed to bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D visual grounding exhibits significantly worse performance than learning and evaluation with a single dataset because of the 3D data and language variants across datasets. Moreover, (ii) better object detector and localization modules and fusing 3D data and multi-view CLIP-based image features can alleviate this lower performance. Our Cross3DVG task can provide a benchmark for developing robust 3D visual grounding models to handle diverse 3D scenes while leveraging deep language understanding.
Paper Structure (21 sections, 10 figures, 7 tables)

This paper contains 21 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Example of cross-dataset 3D visual grounding with our Cross3DVG dataset consisting of ScanRefer and RIORefer. We evaluate the cross-dataset 3D visual grounding model trained on one dataset using the other dataset.
  • Figure 2: Mutual differences of ScanRefer and RIORefer datasets.
  • Figure 3: Class category distributions for ScanRefer and RIORefer, based on the numbers of surface point annotations per category. Categories are ordered by the number of points per category on ScanRefer.
  • Figure 4: Overview of the proposed method. Our method takes as input a 3D point cloud and RGB images of a scene. For each object proposal predicted by the object detector module, the method retrieves the surrounding object images using the camera pose matrix and the bounding box coordinates. The method assigns weights to image features extracted from the surrounding object images by the CLIP image encoder (CLIP$_\mathrm{I}$) using text features encoded from the description by the CLIP text encoder (CLIP$_\mathrm{T}$). Finally, the object localization module computes the object confidence scores using 2D and 3D object features, subsequently predicting the bounding box of the target object.
  • Figure 5: Qualitative results from 3DVG-Transformer and our method (Ours). The ground truth (GT) boxes are shows as green, whereas the predicted boxes with an IoU score higher than 0.5 are shown as blue. The text at the top of the image presents a given description for visual grounding. The additional images at the bottom of our method (Ours) are retrieved images based on the coordinates of object proposals.
  • ...and 5 more figures