Table of Contents
Fetching ...

Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning

Guanglin Zhou, Zhongyi Han, Shiming Chen, Biwei Huang, Liming Zhu, Salman Khan, Xin Gao, Lina Yao

TL;DR

This work investigates in-context learning (ICL) as an effective alternative for enhancing LMMs' adaptability and proposes InvariantSelectPR, a novel method leveraging Class-conditioned Contrastive Invariance (CCI) for more robust demonstration selection.

Abstract

Recent studies indicate that large multimodal models (LMMs) potentially act as general-purpose assistants and are highly robust against different distributions. Despite this, domain-specific adaptation is still necessary particularly in specialized areas like healthcare. Due to the impracticality of fine-tuning LMMs given their vast parameter space, this work investigates in-context learning (ICL) as an effective alternative for enhancing LMMs' adaptability. Our study addresses this by evaluating an unsupervised ICL method which selects in-context examples through a nearest example search based on feature similarity. We uncover that its effectiveness is limited by the deficiencies of pre-trained vision encoders under distribution shift scenarios. To address these challenges, we propose InvariantSelectPR, a novel method leveraging Class-conditioned Contrastive Invariance (CCI) for more robust demonstration selection. Specifically, CCI enhances pre-trained vision encoders by improving their discriminative capabilities across different classes and ensuring invariance to domain-specific variations. This enhancement allows the encoders to effectively identify and retrieve the most informative examples, which are then used to guide LMMs in adapting to new query samples under varying distributions. Our experiments show that InvariantSelectPR substantially improves the adaptability of LMMs, achieving significant performance gains on benchmark datasets, with a 34.2%$\uparrow$ accuracy increase in 7-shot on Camelyon17 and 16.9%$\uparrow$ increase in 7-shot on HAM10000 compared to the baseline zero-shot performance.

Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning

TL;DR

This work investigates in-context learning (ICL) as an effective alternative for enhancing LMMs' adaptability and proposes InvariantSelectPR, a novel method leveraging Class-conditioned Contrastive Invariance (CCI) for more robust demonstration selection.

Abstract

Recent studies indicate that large multimodal models (LMMs) potentially act as general-purpose assistants and are highly robust against different distributions. Despite this, domain-specific adaptation is still necessary particularly in specialized areas like healthcare. Due to the impracticality of fine-tuning LMMs given their vast parameter space, this work investigates in-context learning (ICL) as an effective alternative for enhancing LMMs' adaptability. Our study addresses this by evaluating an unsupervised ICL method which selects in-context examples through a nearest example search based on feature similarity. We uncover that its effectiveness is limited by the deficiencies of pre-trained vision encoders under distribution shift scenarios. To address these challenges, we propose InvariantSelectPR, a novel method leveraging Class-conditioned Contrastive Invariance (CCI) for more robust demonstration selection. Specifically, CCI enhances pre-trained vision encoders by improving their discriminative capabilities across different classes and ensuring invariance to domain-specific variations. This enhancement allows the encoders to effectively identify and retrieve the most informative examples, which are then used to guide LMMs in adapting to new query samples under varying distributions. Our experiments show that InvariantSelectPR substantially improves the adaptability of LMMs, achieving significant performance gains on benchmark datasets, with a 34.2% accuracy increase in 7-shot on Camelyon17 and 16.9% increase in 7-shot on HAM10000 compared to the baseline zero-shot performance.
Paper Structure (28 sections, 5 equations, 9 figures, 7 tables)

This paper contains 28 sections, 5 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparative illustration of (a) zero-shot transfer, which relies on LMMs' pre-trained knowledge to respond to queries, potentially leading to a large distribution gap between pre-training data and query samples, and (b) in-context learning (ICL), which introduces an example from a closer distribution with query sample to bridge this gap. This work investigates different retrieval methods for selecting effective ICL examples.
  • Figure 2: A proxy task to evaluate potential distribution shifts in LMMs, illustrating zero-shot performance across various datasets compared to random guessing. Red horizontal lines indicate the average performance of LMMs for each dataset, and their minor deviations from random guessing highlight significant shifts, particularly in medical contexts such as Camelyon17, HAM10000, NIH-Chest, and COVID datasets.
  • Figure 3: ICL Demonstrations under Distribution Shifts: (a) Performance comparison between Zero-shot and RandomPR, illustrating the limitations of random in-context example selection across four datasets, where one-shot RandomPR often underperforms compared to zero-shot. (b) Analysis of 77 query samples from the target domain, hospital_3 in Camelyon17, using 50 distinct one-shot examples to examine performance variability. Mean values are marked in blue, and variance is represented by black lines, highlighting the significant impact of example selection on model accuracy. If appropriate in-context samples are chosen, there is a potential for gains up to 40.25%.
  • Figure 4: Overview of three retrieval methods: RandomPR, TopKNearestPR, and InvariantSelectPR. RandomPR selects examples without specific criteria, often overlooking informative ones. TopKNearestPR uses feature similarities for selection, yet struggles with domain-specific tasks where pre-trained encoder features lack sufficient detail. In contrast, InvariantSelectPR uses a class-conditioned contrastive invariance (CCI) framework to enhance vision encoders, effectively identifying the most representative samples by focusing on key invariant features.
  • Figure 5: Basic prompt template in all ICL experiments.
  • ...and 4 more figures