Table of Contents
Fetching ...

GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection

Xingyu Peng, Si Liu, Chen Gao, Yan Bai, Beipeng Mu, Xiaofei Wang, Huaxia Xia

TL;DR

GLRD tackles LiDAR-based 3D Open-Vocabulary Detection by fusing local object features with global scene understanding through LLMs, and by enforcing robust decision-making with a Probabilistic Soft Logic solver and a debate mechanism. Key contributions include Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL) to improve supervision and proposals, Static and Dynamic Balance between Classes (SBC/DBC) to address class imbalance, and OV-PSL-based global-local collaboration for keep/remove/reclassify decisions with a scene-driven debate for confusable objects. The framework demonstrates strong gains on ScanNet and SUN RGB-D in both Partial and Full Open-Vocabulary settings, achieving state-of-the-art results in several metrics such as $AP_{25}^{novel}$ and $AP_{25}^{mean}$, and showing substantial improvements in top-$10$ and top-$20$ class settings in the full-vocabulary regime. Overall, GLRD establishes a principled approach to integrate scene context, common sense reasoning, and probabilistic optimization to advance 3D open-vocabulary detection in noisy point-cloud data.

Abstract

The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative Reason and Debate with PSL (GLRD) framework for the 3D OVD task, considering both local object-level information and global scene-level information. Specifically, LLM is utilized to perform common sense reasoning based on object-level and scene-level information, where the detection result is refined accordingly. To further boost the LLM's ability of precise decisions, we also design a probabilistic soft logic solver (OV-PSL) to search for the optimal solution, and a debate scheme to confirm the class of confusable objects. In addition, to alleviate the uneven distribution of classes, a static balance scheme (SBC) and a dynamic balance scheme (DBC) are designed. In addition, to reduce the influence of noise in data and training, we further propose Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL). Extensive experiments conducted on ScanNet and SUN RGB-D demonstrate the superiority of GLRD, where absolute improvements in mean average precision are $+2.82\%$ on SUN RGB-D and $+3.72\%$ on ScanNet in the partial open-vocabulary setting. In the full open-vocabulary setting, the absolute improvements in mean average precision are $+4.03\%$ on ScanNet and $+14.11\%$ on SUN RGB-D.

GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection

TL;DR

GLRD tackles LiDAR-based 3D Open-Vocabulary Detection by fusing local object features with global scene understanding through LLMs, and by enforcing robust decision-making with a Probabilistic Soft Logic solver and a debate mechanism. Key contributions include Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL) to improve supervision and proposals, Static and Dynamic Balance between Classes (SBC/DBC) to address class imbalance, and OV-PSL-based global-local collaboration for keep/remove/reclassify decisions with a scene-driven debate for confusable objects. The framework demonstrates strong gains on ScanNet and SUN RGB-D in both Partial and Full Open-Vocabulary settings, achieving state-of-the-art results in several metrics such as and , and showing substantial improvements in top- and top- class settings in the full-vocabulary regime. Overall, GLRD establishes a principled approach to integrate scene context, common sense reasoning, and probabilistic optimization to advance 3D open-vocabulary detection in noisy point-cloud data.

Abstract

The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative Reason and Debate with PSL (GLRD) framework for the 3D OVD task, considering both local object-level information and global scene-level information. Specifically, LLM is utilized to perform common sense reasoning based on object-level and scene-level information, where the detection result is refined accordingly. To further boost the LLM's ability of precise decisions, we also design a probabilistic soft logic solver (OV-PSL) to search for the optimal solution, and a debate scheme to confirm the class of confusable objects. In addition, to alleviate the uneven distribution of classes, a static balance scheme (SBC) and a dynamic balance scheme (DBC) are designed. In addition, to reduce the influence of noise in data and training, we further propose Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL). Extensive experiments conducted on ScanNet and SUN RGB-D demonstrate the superiority of GLRD, where absolute improvements in mean average precision are on SUN RGB-D and on ScanNet in the partial open-vocabulary setting. In the full open-vocabulary setting, the absolute improvements in mean average precision are on ScanNet and on SUN RGB-D.

Paper Structure

This paper contains 22 sections, 17 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: (a) In LiDAR-Based 3D OVD, the object class may be wrongly recognized when considering only object-level/local information, e.g. mistaking the cabinet for a desk. (b)(c) In contrast, GLRD considers both the scene-level/global information and the object-level/local information. Specifically, LLM is utilized to conduct scene understanding and common sense reasoning. Besides to boost LLM's ability of precise decision, a probabilistic soft logic solver and a debate scheme are devised.
  • Figure 2: Overview of GLRD. The GLRD framework enhances 3D open-vocabulary object detection from three aspects: data (yellow block), training (green block), and inference (red block). The yellow block presents the generation of pseudo labels, the green block shows the training pipeline, and the red block demonstrates the workflow of inference. (i) In the data aspect, a circulation is established with Reflected Pseudo Labels Generation (RPLG) and Static Balance between Classes (SBC) to generate precise 3D pseudo labels. (ii) In the training aspect, Background-Aware Object Localization (BAOL) is proposed to distinguish foreground objects from the background and remove low-quality proposals. Besides, Dynamic Balance between Classes (DBC) balances model attention across different classes. (iii) In the inference aspect, LLM is utilized to conduct Global-Local Collaboration and refines the initial detection result. A probabilistic soft logic solver (OV-PSL) is designed to rate scores for each detected object based on common sense constraints.
  • Figure 3: The Reflected Pseudo Labels Generation (RPLG) module. The image patch and class of each original 2D pseudo label are sent into CLIP with two templates. CLIP judges the image patch's consistency with the class by computing its similarity with two text templates. The labels whose $\phi^+$ is below the threshold $\phi_{CLIP}$ are deleted, forming more accurate pseudo labels.
  • Figure 4: The balance mechanism is composed of Static Balance between Classes (SBC) and Dynamic Balance between Classes (DBC). (a) SBC balances the number of pseudo labels of different classes by adjusting the confidence threshold automatically. (b) DBC balances the learning efficiency of different classes by adjusting the loss weight automatically.
  • Figure 5: The pipeline of Global-Local Collaboration. (i) common sense is retrieved from LLM to form constraints $x_{size},x_{scene},x_{conf}$. (ii) The OV-PSL is utilized to work out the optimal operation to the object. The operation is chosen from "keep/remove/reclassify". (iii) If the operation is "reclassify", then a debate is conducted to determine the class of the object.
  • ...and 3 more figures