LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection
Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao
TL;DR
LAB-Det introduces training-free one-shot domain generalization for object detection by translating one exemplar per class into descriptive language and conditioning a frozen detector via language prompts. By combining exemplar-derived phrases with a language-conditioned detection pipeline and a light BLIP-based calibration, it achieves strong cross-domain generalization in data-scarce, high-IB domains such as underwater and industrial inspection. The approach is supported by theoretical perspectives (Product-of-Experts fusion and a domain-adaptation bound) and demonstrates superior performance over fine-tuned CD-FSOD baselines on UODD and NEU-DET, underscoring the practical value of language as a domain-invariant bridge. This work suggests a broader shift toward prompt-based, training-free adaptation for foundation detectors, with potential extensions to multi-shot, video, and 3D modalities.
Abstract
Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.
