Table of Contents
Fetching ...

Retrieving Counterfactuals Improves Visual In-Context Learning

Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang

Abstract

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

Retrieving Counterfactuals Improves Visual In-Context Learning

Abstract

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.
Paper Structure (42 sections, 11 equations, 12 figures, 11 tables)

This paper contains 42 sections, 11 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Illustration of how composed image retrieval provides additional causal understanding.
  • Figure 2: Overview of the CIRCLES framework. Given a query image $I_q$ and question $Q_q$, the top branch illustrates correlational understanding via standard image retrieval, while the bottom branch depicts causal understanding using attribute-guided composed image retrieval with counterfactual captions from the VLM $\Phi$. Blue/orange rounded rectangles represent image/text embeddings. $\mathcal{R}_{\text{corr}}$ and $\mathcal{R}_{\text{causal}}$ denote the retrieved in-context examples for answer prediction.
  • Figure 3: Qualitative comparison of in-context examples retrieved by RICES and CIRCLES for a CUB test image (Magnolia Warbler). Top: standard image retrieval (IR) neighbors used by RICES, leading to incorrect predictions. Bottom: counterfactual examples from composed image retrieval (CIR) in CIRCLES, highlighting key attribute changes and guiding the model to the correct label.
  • Figure 4: Performance comparison between CIRCLES and RICES on the CUB dataset under varying levels of information scarcity in the training set.
  • Figure 5: Comparison of CIRCLES performance using CIR implemented by CIReVL and OSrCIR on the CUB dataset with Gemma3 (4B/12B) backbones.
  • ...and 7 more figures