Table of Contents
Fetching ...

Unifying 3D Vision-Language Understanding via Promptable Queries

Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

TL;DR

PQ3D addresses the fragmentation of 3D vision–language understanding by unifying voxel, point-cloud, and multi-view representations under a promptable-query framework. It introduces three innovations: segment-level unification of heterogeneous 3D representations, an attention-based prompt-guided query decoder, and universal output heads enabling multi-task learning. Across ten diverse 3D-VL datasets, PQ3D achieves state-of-the-art performance on tasks from instance segmentation to dense captioning and embodied navigation, with notable gains on ScanNet200, ScanRefer, Multi3DRefer, and Scan2Cap, and it demonstrates zero-shot prompting capabilities such as image sketches guiding object localization. The work highlights the potential for a single model to perform broad 3D-VL reasoning and planning, offering a practical pathway toward embodied agents that can reason about and act in the 3D world.

Abstract

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.

Unifying 3D Vision-Language Understanding via Promptable Queries

TL;DR

PQ3D addresses the fragmentation of 3D vision–language understanding by unifying voxel, point-cloud, and multi-view representations under a promptable-query framework. It introduces three innovations: segment-level unification of heterogeneous 3D representations, an attention-based prompt-guided query decoder, and universal output heads enabling multi-task learning. Across ten diverse 3D-VL datasets, PQ3D achieves state-of-the-art performance on tasks from instance segmentation to dense captioning and embodied navigation, with notable gains on ScanNet200, ScanRefer, Multi3DRefer, and Scan2Cap, and it demonstrates zero-shot prompting capabilities such as image sketches guiding object localization. The work highlights the potential for a single model to perform broad 3D-VL reasoning and planning, offering a practical pathway toward embodied agents that can reason about and act in the 3D world.

Abstract

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.
Paper Structure (23 sections, 9 equations, 10 figures, 15 tables)

This paper contains 23 sections, 9 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: PQ3D is a unified model for 3D vision-language understanding, capable of taking various prompts (object categories, referring sentences, images, locations) to perform a wide range of tasks in a 3D scene, including instance segmentation, visual grounding, question answering and dense captioning. Remarkably, PQ3D can take a novel prompt type unseen during training, e.g., an image sketch of a vase, to locate the related object in the scene. If further instruction-tuned with a large language model and plugged into an embodied agent, PQ3D can also plan a complex task and navigate the agent to desired objects.
  • Figure 2: Comparison between PQ3D and other models. (a) When comparing PQ3D to other state-of-the-art (SOTA) methods, PQ3D demonstrates superior performance on most tasks. (b) Previous models have been designed for specific tasks and representations, often limiting the potential for developing a unified model. (c) Our PQ3D can flexibly accommodate various input representations, effectively addressing a wide range of tasks.
  • Figure 3: The model architecture of PQ3D, which consists of Task Prompt Encoding, 3D Scene Encoding, and Prompt-guided Query Learning modules. In prompt encoding, task prompts in diverse formats are projected to a shared feature space. In scene encoding, point clouds, voxel grids, and multi-view images of a scene are first encoded by corresponding encoders and then aligned into a shared 3D coordinate space. The prompt-guided query learning module takes in zero-initialized instance queries and progressively retrieves task-relevant information from aligned scene features under the guidance of task prompts. Finally, each updated instance query is fed into three output heads to predict an instance mask, a task-relevance score, and a sentence.
  • Figure 4: Ablation study of query decoder depth.
  • Figure 5: Qualitative examples from PQ3D. Red bounding box denotes the result from PQ3D, and green denotes ground truth.
  • ...and 5 more figures