Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Zhuo Huang; Chang Liu; Yinpeng Dong; Hang Su; Shibao Zheng; Tongliang Liu

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Zhuo Huang, Chang Liu, Yinpeng Dong, Hang Su, Shibao Zheng, Tongliang Liu

TL;DR

This work tackles the vulnerability of vision models to distribution shifts by introducing Machine Vision Therapy (MVT), which leverages Multimodal Large Language Models through Denoising In-Context Learning to produce corrected supervision for downstream fine-tuning. A noise-transition matrix identifies likely confusions, and a two-exemplar in-context prompt enables Diagnosing and Therapy steps that rectify predictions without additional human labeling. The approach combines Transition Matrix Estimation, DICL, and targeted Fine-Tuning, with theoretical guarantees and extensive experiments on ImageNet variants, WILDS, and DomainBed showing improved ID and OOD robustness and performance on fine-grained attributes. The work demonstrates a practical, label-efficient pathway to enhance visual recognition under domain shifts, with publicly available code for reproducibility.

Abstract

Although vision models such as Contrastive Language-Image Pre-Training (CLIP) show impressive generalization performance, their zero-shot robustness is still limited under Out-of-Distribution (OOD) scenarios without fine-tuning. Instead of undesirably providing human supervision as commonly done, it is possible to take advantage of Multi-modal Large Language Models (MLLMs) that hold powerful visual understanding abilities. However, MLLMs are shown to struggle with vision problems due to the incompatibility of tasks, thus hindering their utilization. In this paper, we propose to effectively leverage MLLMs to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner. To solve the incompatibility issue, we propose a novel Denoising In-Context Learning (DICL) strategy to align vision tasks with MLLMs. Concretely, by estimating a transition matrix that captures the probability of one class being confused with another, an instruction containing a correct exemplar and an erroneous one from the most probable noisy class can be constructed. Such an instruction can help any MLLMs with ICL ability to detect and rectify incorrect predictions of vision models. Through extensive experiments on ImageNet, WILDS, DomainBed, and other OOD datasets, we carefully validate the quantitative and qualitative effectiveness of our method. Our code is available at https://github.com/tmllab/Machine_Vision_Therapy.

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

TL;DR

Abstract

Paper Structure (51 sections, 6 theorems, 22 equations, 17 figures, 14 tables, 1 algorithm)

This paper contains 51 sections, 6 theorems, 22 equations, 17 figures, 14 tables, 1 algorithm.

Introduction
Methodology
Problem Formulation and Overview
Transition Matrix Estimation
Denoising In-Context Learning
Diagnosing.
Therapy.
Fine-Tuning of Vision Models
Theoretical Analysis
Experiments
Experimental Setup
Datasets.
Models and baselines.
Settings.
Quantitative Comparison
...and 36 more sections

Key Result

Theorem 2.2

Assume that the above assumptions hold, if for all $\phi\in\Phi$, $\phi\neq\phi^*$, the concept $\phi^*$ satisfies the distinguishability condition: $\sum_{j=1}^k KL_j(\phi^*\|\phi) > \epsilon_{start}^{\phi} + \epsilon_{delim}^{\phi}$, then as $n\rightarrow\infty$, the prediction according to the pr Thus, the in-context predictor $f_n$ achieves the optimal $0-1$ risk: $\lim_{n\rightarrow\infty}\ma

Figures (17)

Figure 1: Illustration of our methodology: Upper row: Comparison between common fine-tuning process and fine-tuning via Machine Vision Therapy. Our method potentially eliminates the necessity for human-annotation by leveraging the knowledge from MLLMs. Lower row: Comparison between previous MLLM solution to vision tasks and Denoising In-Context Learning strategy. Instead of considering all classes, our method make predictions by presenting a pair of positive and negative exemplars.
Figure 2: Workflow of our Machine Vision Therapy: The orange part demonstrates the Transition Matrix Estimation, the blue part indicates the Denoising In-Context Learning process, and the green part illustrates the Fine-Tuning of vision models.
Figure 3: Ablation study on transition matrix estimation by comparing our method with random sampling and ground truth.
Figure 4: Ablation study on detection score distribution. Upper: ImageNet-A; Lower: ImageNet-V.
Figure 5: Performance analysis by varying the number of top-$N$ chosen noisy classes.
...and 12 more figures

Theorems & Definitions (7)

Theorem 2.2
Lemma 2.3
Theorem 2.4
Theorem 2.7
Lemma 2.8
Theorem 2.9
proof

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

TL;DR

Abstract

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (7)