Table of Contents
Fetching ...

Test-Time Adaptive Object Detection with Foundation Model

Yingjie Gao, Yanan Zhang, Zhi Cai, Di Huang

TL;DR

The paper tackles the challenge of test-time adaptive object detection without access to source data, enabling open-set cross-domain adaptation by leveraging vision-language foundation models. It introduces a Multi-modal Prompt-based Mean-Teacher framework that jointly tunes text and visual prompts and uses a Test-time Warm-start to preserve teacher performance, combined with an Instance Dynamic Memory module to maintain high-quality pseudo-labels. The IDM framework adds Memory Enhancement and Memory Hallucination to refine predictions and synthesize positive samples for hard cases, respectively. Empirical results on cross-corruption and cross-dataset benchmarks demonstrate strong, consistent gains over state-of-the-art methods, highlighting practical potential for open-vocabulary, source-free TTAOD in real-world deployment.

Abstract

In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.

Test-Time Adaptive Object Detection with Foundation Model

TL;DR

The paper tackles the challenge of test-time adaptive object detection without access to source data, enabling open-set cross-domain adaptation by leveraging vision-language foundation models. It introduces a Multi-modal Prompt-based Mean-Teacher framework that jointly tunes text and visual prompts and uses a Test-time Warm-start to preserve teacher performance, combined with an Instance Dynamic Memory module to maintain high-quality pseudo-labels. The IDM framework adds Memory Enhancement and Memory Hallucination to refine predictions and synthesize positive samples for hard cases, respectively. Empirical results on cross-corruption and cross-dataset benchmarks demonstrate strong, consistent gains over state-of-the-art methods, highlighting practical potential for open-vocabulary, source-free TTAOD in real-world deployment.

Abstract

In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.

Paper Structure

This paper contains 19 sections, 9 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: (a) Traditional TTAOD methods require source domain statistical characteristics and are limited to closed-set during adaptation. (b) Our method requires no source data while possessing open-vocabulary capability.
  • Figure 2: Overview of our method. It comprises two components: (1) the Multi-modal Prompt-based Mean-Teacher framework shown in (a), incorporating text prompt tuning (green-highlighted) and visual prompt tuning (blue-highlighted) with a Test-time Warm-start strategy; and (2) an Instance Dynamic Memory module that stores high-quality pseudo-labels from previous test samples, integrating with Memory Enhancement (b) and Memory Hallucination (c).
  • Figure 3: Different behaviors of visual prompts. (a) Init visual prompts by average-pooling image tokens from the first test sample before TTA. (b) Insert visual prompts with image tokens for every test sample during TTA.
  • Figure 4: Results on the cross-dataset benchmark comprising 13 diverse object detection datasets.
  • Figure 5: Comparison on the Maximum Capacity of IDM.
  • ...and 7 more figures