Table of Contents
Fetching ...

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal, Heitor R. Medeiros, Marco Pedersoli, Eric Granger

TL;DR

This work tackles robustness of vision-language object detectors under domain shift by introducing VLOD-TTA, a test-time adaptation framework that combines IoU-weighted entropy minimization with image-conditioned prompt selection. By exploiting dense, overlapping proposals and updating only lightweight adapters in a single adaptation step, VLOD-TTA enhances zero-shot performance across stylized, driving, low-light, and corrupt domains for both YOLO-World and Grounding DINO, without requiring labels. Key contributions include a detection-specific entropy objective that emphasizes spatially coherent proposal clusters and a per-image prompt selection mechanism that fuses the most informative prompts with detector logits. The method demonstrates consistent improvements over baselines on a comprehensive benchmark spanning stylized datasets, driving scenarios, and COCO-C/PASCAL-C corrupted variants, highlighting practical impact for robust open-vocabulary detection in real-world settings.

Abstract

Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

TL;DR

This work tackles robustness of vision-language object detectors under domain shift by introducing VLOD-TTA, a test-time adaptation framework that combines IoU-weighted entropy minimization with image-conditioned prompt selection. By exploiting dense, overlapping proposals and updating only lightweight adapters in a single adaptation step, VLOD-TTA enhances zero-shot performance across stylized, driving, low-light, and corrupt domains for both YOLO-World and Grounding DINO, without requiring labels. Key contributions include a detection-specific entropy objective that emphasizes spatially coherent proposal clusters and a per-image prompt selection mechanism that fuses the most informative prompts with detector logits. The method demonstrates consistent improvements over baselines on a comprehensive benchmark spanning stylized datasets, driving scenarios, and COCO-C/PASCAL-C corrupted variants, highlighting practical impact for robust open-vocabulary detection in real-world settings.

Abstract

Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA

Paper Structure

This paper contains 29 sections, 8 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Motivation (IWE). Left→right: (i) proposals from two classes—Person (red) (cluster size = 167, max score = 0.14) and Dog (blue) (cluster size = 45, max score = 0.15); (ii) ZS scores fall below the threshold, resulting in a missed detection; (iii) standard entropy minimization overconfidently produces a dog false positive; and (iv) our IoU-weighted entropy minimization focuses updates on dense clusters and suppresses isolated boxes.
  • Figure 2: Motivation (IPS). Left→right: (i) ZS predictions with a correct detection Person (red) and a false positive Dog (blue); (ii) prompt–class score heatmap with circles marking prompts selected by our image-conditioned strategy and right-margin bars showing $S_{\text{PS}}-S_{\text{PA}}$; (iii) prompt averaging (PA) reduces the class score, producing no detections; and (iv) prompt selection (PS) suppresses the dog false positive while preserving the person detection.
  • Figure 3: Overview of our VLOD-TTA. Given an input image and a set of prompts, the text encoder produces embeddings that interact with region proposals via the vision–language head to compute similarity scores. IPS performs top-$\rho$ prompt selection and averages the selected prompts to obtain per-proposal class scores. Then, it combines per-proposal entropy with IoU-based weights to form an IWE objective that drives robust TTA.
  • Figure 4: Prompt generation strategies.$\Delta$mAP$_{50}$ over three style-shift datasets measured relative to ZS.
  • Figure 5: Adapters in different detector modules. Mean $\Delta$mAP$_{50}$ averaged over three style-shift datasets, relative to ZS.
  • ...and 7 more figures