From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning
Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li
TL;DR
This work tackles the problem of generalizing LiDAR-based 3D object detection across domains with limited supervision by introducing GCFS, a framework that couples image-guided multi-modal fusion with contrastive prototype learning. The method injects open-set 2D semantic cues from vision-language models into 3D features using a physically-aware 2D-to-3D box search, and learns discriminative, class-specific prototypes from few-shot data with contrastive learning. Optimized with a meta-learning scheme and combined losses, the approach achieves state-of-the-art results on four GCFS benchmarks, demonstrating strong transferability for common classes and rapid adaptation to novel classes under tight data constraints. This work significantly advances practical 3D perception by enabling robust deployment in diverse environments without requiring large-scale target annotations, leveraging cross-modal priors and few-shot conditioning to bridge domain gaps and semantic shifts.
Abstract
LiDAR-based 3D object detection models often struggle to generalize to real-world environments due to limited object diversity in existing datasets. To tackle it, we introduce the first generalized cross-domain few-shot (GCFS) task in 3D object detection, aiming to adapt a source-pretrained model to both common and novel classes in a new domain with only few-shot annotations. We propose a unified framework that learns stable target semantics under limited supervision by bridging 2D open-set semantics with 3D spatial reasoning. Specifically, an image-guided multi-modal fusion injects transferable 2D semantic cues into the 3D pipeline via vision-language models, while a physically-aware box search enhances 2D-to-3D alignment via LiDAR priors. To capture class-specific semantics from sparse data, we further introduce contrastive-enhanced prototype learning, which encodes few-shot instances into discriminative semantic anchors and stabilizes representation learning. Extensive experiments on GCFS benchmarks demonstrate the effectiveness and generality of our approach in realistic deployment settings.
