VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal; Heitor R. Medeiros; Marco Pedersoli; Eric Granger

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal, Heitor R. Medeiros, Marco Pedersoli, Eric Granger

TL;DR

This work tackles robustness of vision-language object detectors under domain shift by introducing VLOD-TTA, a test-time adaptation framework that combines IoU-weighted entropy minimization with image-conditioned prompt selection. By exploiting dense, overlapping proposals and updating only lightweight adapters in a single adaptation step, VLOD-TTA enhances zero-shot performance across stylized, driving, low-light, and corrupt domains for both YOLO-World and Grounding DINO, without requiring labels. Key contributions include a detection-specific entropy objective that emphasizes spatially coherent proposal clusters and a per-image prompt selection mechanism that fuses the most informative prompts with detector logits. The method demonstrates consistent improvements over baselines on a comprehensive benchmark spanning stylized datasets, driving scenarios, and COCO-C/PASCAL-C corrupted variants, highlighting practical impact for robust open-vocabulary detection in real-world settings.

Abstract

Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

TL;DR

Abstract

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)