Table of Contents
Fetching ...

From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning

Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li

TL;DR

This work tackles the problem of generalizing LiDAR-based 3D object detection across domains with limited supervision by introducing GCFS, a framework that couples image-guided multi-modal fusion with contrastive prototype learning. The method injects open-set 2D semantic cues from vision-language models into 3D features using a physically-aware 2D-to-3D box search, and learns discriminative, class-specific prototypes from few-shot data with contrastive learning. Optimized with a meta-learning scheme and combined losses, the approach achieves state-of-the-art results on four GCFS benchmarks, demonstrating strong transferability for common classes and rapid adaptation to novel classes under tight data constraints. This work significantly advances practical 3D perception by enabling robust deployment in diverse environments without requiring large-scale target annotations, leveraging cross-modal priors and few-shot conditioning to bridge domain gaps and semantic shifts.

Abstract

LiDAR-based 3D object detection models often struggle to generalize to real-world environments due to limited object diversity in existing datasets. To tackle it, we introduce the first generalized cross-domain few-shot (GCFS) task in 3D object detection, aiming to adapt a source-pretrained model to both common and novel classes in a new domain with only few-shot annotations. We propose a unified framework that learns stable target semantics under limited supervision by bridging 2D open-set semantics with 3D spatial reasoning. Specifically, an image-guided multi-modal fusion injects transferable 2D semantic cues into the 3D pipeline via vision-language models, while a physically-aware box search enhances 2D-to-3D alignment via LiDAR priors. To capture class-specific semantics from sparse data, we further introduce contrastive-enhanced prototype learning, which encodes few-shot instances into discriminative semantic anchors and stabilizes representation learning. Extensive experiments on GCFS benchmarks demonstrate the effectiveness and generality of our approach in realistic deployment settings.

From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning

TL;DR

This work tackles the problem of generalizing LiDAR-based 3D object detection across domains with limited supervision by introducing GCFS, a framework that couples image-guided multi-modal fusion with contrastive prototype learning. The method injects open-set 2D semantic cues from vision-language models into 3D features using a physically-aware 2D-to-3D box search, and learns discriminative, class-specific prototypes from few-shot data with contrastive learning. Optimized with a meta-learning scheme and combined losses, the approach achieves state-of-the-art results on four GCFS benchmarks, demonstrating strong transferability for common classes and rapid adaptation to novel classes under tight data constraints. This work significantly advances practical 3D perception by enabling robust deployment in diverse environments without requiring large-scale target annotations, leveraging cross-modal priors and few-shot conditioning to bridge domain gaps and semantic shifts.

Abstract

LiDAR-based 3D object detection models often struggle to generalize to real-world environments due to limited object diversity in existing datasets. To tackle it, we introduce the first generalized cross-domain few-shot (GCFS) task in 3D object detection, aiming to adapt a source-pretrained model to both common and novel classes in a new domain with only few-shot annotations. We propose a unified framework that learns stable target semantics under limited supervision by bridging 2D open-set semantics with 3D spatial reasoning. Specifically, an image-guided multi-modal fusion injects transferable 2D semantic cues into the 3D pipeline via vision-language models, while a physically-aware box search enhances 2D-to-3D alignment via LiDAR priors. To capture class-specific semantics from sparse data, we further introduce contrastive-enhanced prototype learning, which encodes few-shot instances into discriminative semantic anchors and stabilizes representation learning. Extensive experiments on GCFS benchmarks demonstrate the effectiveness and generality of our approach in realistic deployment settings.

Paper Structure

This paper contains 21 sections, 12 equations, 5 figures, 19 tables.

Figures (5)

  • Figure 1: GCFS in 3D object detection aims to adapt source-pretrained models for strong performance on common and novel classes in the target domain via limited target samples.
  • Figure 2: Proposed GCFS Framework. We first pretrain a detection model with source data. During model finetuning using target few-shot samples, each query—the image and point cloud pair—is processed by GDino+SAM and 3D backbone for 2D instance-level masks and 3D features (top block). Insights from 2D context contribute to 1) enriching 3D features $F^{\text{fused}}$ with 2D semantic clues and 2) proposing high-quality "Box Candidates" via a novel 2D-to-3D box search. Proposal features $\textbf{F}^{\text{prp}}$ are refined by learnable prototypes $\textbf{F}^{\text{pro}}$ with an attention mechanism, and then passed to the final prediction (bottom block).
  • Figure 3: Physical-aware box searching. Red boxes are GT boxes, and blue ones are searched boxes. Regarding "Cyclist" (left) and "Car" (right), angle and center biases on searched boxes are corrected by $L_\text{BVC}$ and $L_\text{FVD}$.
  • Figure 4: Few-shot feature extraction and CL-enhanced prototype learning. In few-shot feature extraction, 2D and 3D ground-truth labels replace GDino and RPN outputs to extract object features.
  • Figure 5: Meta-learning scheme simulating few-shot learning with domain gaps.