How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

Zhongyi Han; Guanglin Zhou; Rundong He; Jindong Wang; Tailin Wu; Yilong Yin; Salman Khan; Lina Yao; Tongliang Liu; Kun Zhang

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

Zhongyi Han, Guanglin Zhou, Rundong He, Jindong Wang, Tailin Wu, Yilong Yin, Salman Khan, Lina Yao, Tongliang Liu, Kun Zhang

TL;DR

This study interrogates how well GPT-4V adapts to distribution shifts across natural, medical, and molecular domains, comparing it to CLIP, LLaVA, and Gemini through zero-shot, perturbation, and in-context learning evaluations. Using 13 diverse datasets and a structured VQA-style prompt design, the authors quantify GPT-4V’s zero-shot generalization, its robustness to engineered perturbations (Gaussian noise and ControlNet-driven styles), and the potential of in-context learning to bridge domain gaps without parameter updates. Key findings show GPT-4V is relatively robust to natural distribution shifts and perturbations, often outperforming baselines on challenging cases, and that in-context learning can yield meaningful gains in target-domain performance, albeit with dataset-specific variability. The results highlight both the strengths and limitations of current multimodal foundation models, underscoring the need for domain-specific fine-tuning and careful prompt-and-context design to maximize reliability in high-stakes settings like medicine and chemistry. The work provides a publicly available benchmark and prompts a broader discussion on the role of foundation models in robust, real-world AI deployment.

Abstract

In machine learning, generalization against distribution shifts -- where deployment conditions diverge from the training scenarios -- is crucial, particularly in fields like climate modeling, biomedicine, and autonomous driving. The emergence of foundation models, distinguished by their extensive pretraining and task versatility, has led to an increased interest in their adaptability to distribution shifts. GPT-4V(ision) acts as the most advanced publicly accessible multimodal foundation model, with extensive applications across various domains, including anomaly detection, video understanding, image generation, and medical diagnosis. However, its robustness against data distributions remains largely underexplored. Addressing this gap, this study rigorously evaluates GPT-4V's adaptability and generalization capabilities in dynamic environments, benchmarking against prominent models like CLIP, LLaVA, and Gemini. We delve into GPT-4V's zero-shot generalization across 13 diverse datasets spanning natural, medical, and molecular domains. We further investigate its adaptability to controlled data perturbations and examine the efficacy of in-context learning as a tool to enhance its adaptation. Our findings delineate GPT-4V's capability boundaries in distribution shifts, shedding light on its strengths and limitations across various scenarios. Importantly, this investigation contributes to our understanding of how AI foundation models generalize to distribution shifts, offering pivotal insights into their adaptability and robustness. The code is publicly available at https://github.com/jameszhou-gl/gpt-4v-distribution-shift.

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

TL;DR

Abstract

Paper Structure (35 sections, 41 figures, 5 tables)

This paper contains 35 sections, 41 figures, 5 tables.

Introduction
Motivation and Overview
Our Approach in Exploring GPT-4V
How Do We Treat Distribution Shifts in This Work?
Sample Selection Guidance for GPT-4V Evaluation
Prompt Designs
Contributions of This Report
Limitations of This Report
Observations
Zero-shot Generalization Across Varied Domains
Natural Images
Task Introduction
Comparative Accuracies Across Datasets and Domains
Case Demonstration
Medical Images
...and 20 more sections

Figures (41)

Figure 1: Comparative analysis of zero-shot generalization performance across 13 distinct datasets, encompassing natural, medical, and molecular domains. The analysis features the performances of three advanced models: CLIP, LLaVA, GPT-4V and Gemini.
Figure 2: An illustration of a structured prompt format used in the PACS dataset, showcasing a specific approach for image-based questioning and response formatting. The format includes a question about the image's content, a list of answer choices, and a template for answering, including an answer, confidence score, and the reasoning process.
Figure 3: An illustration of a structured prompt format used in the PACS dataset, showcasing a specific approach for image-based questioning and response formatting. The format includes a question about the image's content, a list of answer choices, and a template for answering, including an answer, confidence score, and the reasoning process.
Figure 4: Improvements in target domain performance with in-context learning on GPT-4V across Camelyon17, COVID, DrugOOD_Assay and NIH_Chest datasets.
Figure 5: Demonstration of GPT-4V's inference process when exposed to in-context learning with examples from the Camelyon17 dataset. The experiment involves using two representative images from the source domain (hospital_2), one labeled 'normal' and the other 'tumor', followed by a test image from the target domain (hospital_3). GPT-4V, conditioned with these in-context examples, distinguishes between regular and uniform tissue patterns in the 'normal' image and abnormal, irregular cell sizes in the 'tumor' image. It then applies this contextual understanding to accurately infer the class of the test image from hospital_3. This process showcases GPT-4V's ability to leverage in-context cues for effective domain bridging.
...and 36 more figures

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

TL;DR

Abstract

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

Authors

TL;DR

Abstract

Table of Contents

Figures (41)