GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection
Xingyu Peng, Si Liu, Chen Gao, Yan Bai, Beipeng Mu, Xiaofei Wang, Huaxia Xia
TL;DR
GLRD tackles LiDAR-based 3D Open-Vocabulary Detection by fusing local object features with global scene understanding through LLMs, and by enforcing robust decision-making with a Probabilistic Soft Logic solver and a debate mechanism. Key contributions include Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL) to improve supervision and proposals, Static and Dynamic Balance between Classes (SBC/DBC) to address class imbalance, and OV-PSL-based global-local collaboration for keep/remove/reclassify decisions with a scene-driven debate for confusable objects. The framework demonstrates strong gains on ScanNet and SUN RGB-D in both Partial and Full Open-Vocabulary settings, achieving state-of-the-art results in several metrics such as $AP_{25}^{novel}$ and $AP_{25}^{mean}$, and showing substantial improvements in top-$10$ and top-$20$ class settings in the full-vocabulary regime. Overall, GLRD establishes a principled approach to integrate scene context, common sense reasoning, and probabilistic optimization to advance 3D open-vocabulary detection in noisy point-cloud data.
Abstract
The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative Reason and Debate with PSL (GLRD) framework for the 3D OVD task, considering both local object-level information and global scene-level information. Specifically, LLM is utilized to perform common sense reasoning based on object-level and scene-level information, where the detection result is refined accordingly. To further boost the LLM's ability of precise decisions, we also design a probabilistic soft logic solver (OV-PSL) to search for the optimal solution, and a debate scheme to confirm the class of confusable objects. In addition, to alleviate the uneven distribution of classes, a static balance scheme (SBC) and a dynamic balance scheme (DBC) are designed. In addition, to reduce the influence of noise in data and training, we further propose Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL). Extensive experiments conducted on ScanNet and SUN RGB-D demonstrate the superiority of GLRD, where absolute improvements in mean average precision are $+2.82\%$ on SUN RGB-D and $+3.72\%$ on ScanNet in the partial open-vocabulary setting. In the full open-vocabulary setting, the absolute improvements in mean average precision are $+4.03\%$ on ScanNet and $+14.11\%$ on SUN RGB-D.
