Table of Contents
Fetching ...

PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding

Chenshu Hou, Liang Peng, Xiaopei Wu, Xiaofei He, Wenxiao Wang

TL;DR

PD-APE tackles 3D visual grounding by decoupling object attribute decoding from surrounding layout reasoning through a parallel dual-branch decoder. It introduces adaptive position encoding tailored to each branch: box-surface relative positioning guides target-object focus, while a text-guided gate biases attention toward points carrying meaningful layout information. This combination yields clearer, branch-specific attention maps and state-of-the-art results on ScanRefer and Nr3D without relying on additional 2D cues. The approach enhances grounding accuracy by explicitly modeling both object features and their spatial context in 3D scenes, with efficient computation via the box-surface positioning strategy.

Abstract

3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the attention maps accurately capture the layout described in the text. Benefiting from the proposed dual-branch design, the queries are allowed to focus on points relevant to each branch's specific objective. Moreover, we design an adaptive position encoding method for each branch respectively. In the target object branch, the position encoding relies on the relative positions between seed points and predicted 3D boxes. In the surrounding branch, the attention map is additionally guided by the confidence between visual and text features, enabling the queries to focus on points that have valuable layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and Nr3D.

PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding

TL;DR

PD-APE tackles 3D visual grounding by decoupling object attribute decoding from surrounding layout reasoning through a parallel dual-branch decoder. It introduces adaptive position encoding tailored to each branch: box-surface relative positioning guides target-object focus, while a text-guided gate biases attention toward points carrying meaningful layout information. This combination yields clearer, branch-specific attention maps and state-of-the-art results on ScanRefer and Nr3D without relying on additional 2D cues. The approach enhances grounding accuracy by explicitly modeling both object features and their spatial context in 3D scenes, with efficient computation via the box-surface positioning strategy.

Abstract

3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the attention maps accurately capture the layout described in the text. Benefiting from the proposed dual-branch design, the queries are allowed to focus on points relevant to each branch's specific objective. Moreover, we design an adaptive position encoding method for each branch respectively. In the target object branch, the position encoding relies on the relative positions between seed points and predicted 3D boxes. In the surrounding branch, the attention map is additionally guided by the confidence between visual and text features, enabling the queries to focus on points that have valuable layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and Nr3D.
Paper Structure (25 sections, 9 equations, 5 figures, 6 tables)

This paper contains 25 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison between the serial decoder structure (a) and our parallel decoder structure (b). For grounding results shown in scenes, yellow represents the ground-truth boxes, red is the wrongly localized box, and green is the successfully localized one.
  • Figure 2: Architecture of our 3D Visual Grounding Framework. With the inputs of natural languages and 3D point clouds, our model consists of three encoders for different modalities, a cross encoder, and a novel parallel decoder. Adaptive Position Encoding methods are added to cross-attention modules for both the target object branch and the surrounding branch. The final visual and text output features are aligned with each other to generate the detected boxes.
  • Figure 3: Different methods to calculate the relevant position between the sample of seed points A and the predicted box: (a) Vertex Relative Positioning shen2024vdetr; (b) Center Relative Positioning; (c)(d) Box-surface Relative Positioning for points outside/inside the box.
  • Figure 4: The illustration of the Adaptive Position Encoding method. Step 3 is the text-guided confidence that is only used in the surrounding branch.
  • Figure 5: Visualization results of different models for scenes from the ScanRefer dataset. For all boxes, yellow represents the ground-truth references; red represents results from EDA that contain grounding errors; green represents proposals generated by our PD-APE. Words in different colors show the results of text decoupling.