Table of Contents
Fetching ...

LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection

Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao

TL;DR

LAB-Det introduces training-free one-shot domain generalization for object detection by translating one exemplar per class into descriptive language and conditioning a frozen detector via language prompts. By combining exemplar-derived phrases with a language-conditioned detection pipeline and a light BLIP-based calibration, it achieves strong cross-domain generalization in data-scarce, high-IB domains such as underwater and industrial inspection. The approach is supported by theoretical perspectives (Product-of-Experts fusion and a domain-adaptation bound) and demonstrates superior performance over fine-tuned CD-FSOD baselines on UODD and NEU-DET, underscoring the practical value of language as a domain-invariant bridge. This work suggests a broader shift toward prompt-based, training-free adaptation for foundation detectors, with potential extensions to multi-shot, video, and 3D modalities.

Abstract

Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.

LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection

TL;DR

LAB-Det introduces training-free one-shot domain generalization for object detection by translating one exemplar per class into descriptive language and conditioning a frozen detector via language prompts. By combining exemplar-derived phrases with a language-conditioned detection pipeline and a light BLIP-based calibration, it achieves strong cross-domain generalization in data-scarce, high-IB domains such as underwater and industrial inspection. The approach is supported by theoretical perspectives (Product-of-Experts fusion and a domain-adaptation bound) and demonstrates superior performance over fine-tuned CD-FSOD baselines on UODD and NEU-DET, underscoring the practical value of language as a domain-invariant bridge. This work suggests a broader shift toward prompt-based, training-free adaptation for foundation detectors, with potential extensions to multi-shot, video, and 3D modalities.

Abstract

Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.
Paper Structure (27 sections, 7 equations, 3 figures, 5 tables)

This paper contains 27 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Challenge and conceptual comparison of training-free 1-shot domain generalization for object detection. (a) In specialized domains (e.g., underwater or industrial), data can be very scarce (e.g., only one labeled exemplar per class) and object boundaries are ambiguous, making general-domain detectors fail under severe domain gaps. (b) Training-based approaches rely on fine-tuning with gradient updates, which incur extra cost, overfit under 1-shot supervision, and remain unstable in high-IB domains. In contrast, LAB-Det leverages language as a domain-invariant bridge: support exemplars are re-expressed as descriptive text and injected into frozen vision–language models, enabling training-free, interpretable, and robust target-domain detection.
  • Figure 2: Overview of LAB-Det. Top: exemplar-to-language projection. A single annotated exemplar is segmented by SAM and described by DAM under a domain-aware prompt, yielding natural-language phrases (e.g., “rough texture”). Bottom: these phrases condition a frozen detector (e.g., Grounding DINO) to generate candidate boxes and phrase scores. Category scores are obtained by averaging phrase scores, and an optional BLIP-based calibration refines small or ambiguous detections. The entire pipeline is training-free and interpretable.
  • Figure 3: Qualitative comparison on UODD and NEU-DET datasets. From left to right: predictions from LAB-Det, ground-truth annotations, and Grounding DINO baseline, respectively.