Table of Contents
Fetching ...

Multi-branch Collaborative Learning Network for 3D Visual Grounding

Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji

TL;DR

The paper tackles 3D Visual Grounding by addressing two closely related tasks, 3DREC and 3DRES, with a novel two-branch framework (MCLN) that preserves task-specific learning while enabling effective cross-task collaboration. It introduces Relative Superpoint Aggregation (RSA) to generate coherent superpoint features and Adaptive Soft Alignment (ASA) to align and mutually reinforce predictions from the two branches, including adaptive, quality-aware losses. Empirical results on ScanRefer (and SR3D/NR3D) show state-of-the-art performance for both 3DREC and 3DRES, with Acc@0.5 gains of 2.05 and mIoU gains of 3.96, respectively, and strong ablations validating the contributions of RSA and ASA. This work demonstrates that explicit, independent task branches paired with targeted alignment mechanisms yield robust, cross-task grounding in complex 3D scenes, with potential for broader multi-task grounding applications.

Abstract

3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration. However, existing collaborative approaches predominantly depend on the results of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model's capacity to learn specific information for each task, enabling them to acquire complementary knowledge. Thus, we propose the MCLN framework, which includes independent branches for 3DREC and 3DRES tasks. This enables dedicated exploration of each task and effective coordination between the branches. Furthermore, to facilitate mutual reinforcement between these branches, we introduce a Relative Superpoint Aggregation (RSA) module and an Adaptive Soft Alignment (ASA) module. These modules significantly contribute to the precise alignment of prediction results from the two branches, directing the module to allocate increased attention to key positions. Comprehensive experimental evaluation demonstrates that our proposed method achieves state-of-the-art performance on both the 3DREC and 3DRES tasks, with an increase of 2.05% in Acc@0.5 for 3DREC and 3.96% in mIoU for 3DRES.

Multi-branch Collaborative Learning Network for 3D Visual Grounding

TL;DR

The paper tackles 3D Visual Grounding by addressing two closely related tasks, 3DREC and 3DRES, with a novel two-branch framework (MCLN) that preserves task-specific learning while enabling effective cross-task collaboration. It introduces Relative Superpoint Aggregation (RSA) to generate coherent superpoint features and Adaptive Soft Alignment (ASA) to align and mutually reinforce predictions from the two branches, including adaptive, quality-aware losses. Empirical results on ScanRefer (and SR3D/NR3D) show state-of-the-art performance for both 3DREC and 3DRES, with Acc@0.5 gains of 2.05 and mIoU gains of 3.96, respectively, and strong ablations validating the contributions of RSA and ASA. This work demonstrates that explicit, independent task branches paired with targeted alignment mechanisms yield robust, cross-task grounding in complex 3D scenes, with potential for broader multi-task grounding applications.

Abstract

3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration. However, existing collaborative approaches predominantly depend on the results of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model's capacity to learn specific information for each task, enabling them to acquire complementary knowledge. Thus, we propose the MCLN framework, which includes independent branches for 3DREC and 3DRES tasks. This enables dedicated exploration of each task and effective coordination between the branches. Furthermore, to facilitate mutual reinforcement between these branches, we introduce a Relative Superpoint Aggregation (RSA) module and an Adaptive Soft Alignment (ASA) module. These modules significantly contribute to the precise alignment of prediction results from the two branches, directing the module to allocate increased attention to key positions. Comprehensive experimental evaluation demonstrates that our proposed method achieves state-of-the-art performance on both the 3DREC and 3DRES tasks, with an increase of 2.05% in Acc@0.5 for 3DREC and 3.96% in mIoU for 3DRES.
Paper Structure (34 sections, 15 equations, 5 figures, 10 tables)

This paper contains 34 sections, 15 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: TGNN predicts the results of 3DREC and 3DRES tasks in a cascade manner. Our method introduces separate branches for 3DREC and 3DRES tasks and uses a parallel structure to facilitate collaborative training of them.
  • Figure 2: An overview of our proposed network. Our proposed network comprises a cross-modal encoder, separate decoders for 3DREC and 3DRES, a “Relative Superpoint Aggregation” (RSA) module for superpoint feature generation, and an “Adaptive Soft Alignment” (ASA) module for joint training.
  • Figure 3: Qualitative results from 3D-STMN 2308.16632, EDA wu2022eda and our model.
  • Figure 4: Ablation study on the 3DIE metric, our ASA module effectively decreases the 3DIE value, proving its effectiveness in alignment.
  • Figure 5: Failure cases:(a) and (b) show the inconsistency between predicted masks and bounding boxes, and (c) and (d) showcase cases where the model generates discrete masks alongside inaccuracies in the predicted bounding boxes.