Table of Contents
Fetching ...

Test-Time Visual In-Context Tuning

Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele

TL;DR

This work tackles the poor generalization of Visual In-Context Learning (VICL) under distribution shifts by introducing Test-Time Visual In-Context Tuning (VICT). At inference, VICT flips the VICL prompts and the test sample and employs a cycle-consistency loss to align the in-context inference with the original task prompts, enabling rapid, self-supervised adaptation for a single test instance. Across six vision tasks and 15 corruptions, VICT substantially boosts VICL performance in zero-shot and one-shot settings, with encoder-only fine-tuning offering a favorable efficiency-accuracy trade-off. The approach also demonstrates potential for unseen tasks (e.g., colorization) and highlights practical considerations such as inference cost and the benefit of voxel-level, grid-like prompt interfaces. Overall, VICT provides a practical, self-supervised method to improve robustness of VICL models in real-world, distribution-shifted environments.

Abstract

Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.

Test-Time Visual In-Context Tuning

TL;DR

This work tackles the poor generalization of Visual In-Context Learning (VICL) under distribution shifts by introducing Test-Time Visual In-Context Tuning (VICT). At inference, VICT flips the VICL prompts and the test sample and employs a cycle-consistency loss to align the in-context inference with the original task prompts, enabling rapid, self-supervised adaptation for a single test instance. Across six vision tasks and 15 corruptions, VICT substantially boosts VICL performance in zero-shot and one-shot settings, with encoder-only fine-tuning offering a favorable efficiency-accuracy trade-off. The approach also demonstrates potential for unseen tasks (e.g., colorization) and highlights practical considerations such as inference cost and the benefit of voxel-level, grid-like prompt interfaces. Overall, VICT provides a practical, self-supervised method to improve robustness of VICL models in real-world, distribution-shifted environments.

Abstract

Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.

Paper Structure

This paper contains 14 sections, 4 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Test-time visual in-context tuning (VICT) on six representative vision tasks under distribution shifts. We benchmark the robustness of VICL with 15 common corruptions adopted in hendrycks2019benchmarkingmichaelis2019benchmarking, and report the averaged performance across all corruptions. Existing VICL models like Painter exhibit poor generalization capability to unseen new domains when the task prompts come from the training distribution (i.e., zero-shot). Performances are even worse when given task prompts from the test distribution (i.e., one-shot). By performing VICT at test time, we can significantly improve Painter in both zero-shot and one-shot manners.
  • Figure 2: Overview of our VICT pipeline. Given a pair of task prompts $\left(x,y\right)$ and a test input image $x_t$, we first construct a four-cell grid-like image canvas $I=\left(x,y,x_t,\varnothing\right)$, with an empty cell at the bottom right. We then feed $I$ into the VICL model (e.g., Painter) to predict the test output $\hat{y}_t$. Afterward, we flip the role between input-output task prompts and input-output test samples, i.e., we provide the predicted $\hat{y}_t$ as the prompt to the model, recreating a new four-cell grid-like image canvas $I^{\prime}=\left(x,\varnothing,x_t,\hat{y}_t\right)$, with an empty cell at the top right. The new $I^{\prime}$ is fed into the same model to predict the task prompt output $\hat{y}$. We finally optimize the model by minimizing the distance between $\hat{y}$ and $y$ via a standard regression loss.
  • Figure 3: Comparison with few-shot Painter on six vision tasks with corruptions. We randomly corrupt a certain number of images in the training set, using 1, 2, 4, 8, 16, 32, and 64 shots for training and deploying the model in the full corrupted test sets. We report the final results averaged across 15 corruptions. Our zero-shot or one-shot VICT can outperform Painter trained with more few-shot examples.
  • Figure 4: Analysis on the trade-off between efficiency and accuracy. We use semantic segmentation on ADE20K-C for the ablation. VICT benefits from more training steps, while at the cost of linearly increased training time.
  • Figure 5: Visualizations of test examples and predictions for six main tasks with corruptions. We visualize both zero-shot and one-shot settings for Painter and VICT. Zoom in for best view.
  • ...and 1 more figures