Describing Differences in Image Sets with Natural Language

Lisa Dunlap; Yuhui Zhang; Xiaohan Wang; Ruiqi Zhong; Trevor Darrell; Jacob Steinhardt; Joseph E. Gonzalez; Serena Yeung-Levy

Describing Differences in Image Sets with Natural Language

Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy

TL;DR

This paper defines Set Difference Captioning (SDC), where given two image sets $\\mathcal{D}_A$ and $\\mathcal{D}_B$, the goal is to describe differences that are more characteristic of $\\mathcal{D}_A$. It introduces VisDiff, a two-stage proposer–ranker system that first generates candidate differences from small image subsets using captioning and large language models, then ranks them with a CLIP-based signal across the full sets. A new benchmark, VisDiffBench, with 187 paired image sets and a ground-truth difference description, enables systematic evaluation and ablations across proposers, rankers, and noise conditions. Across multiple applications—from dataset and model comparisons to memorability analyses—VisDiff uncovers meaningful, previously unknown differences, highlighting its utility as an automated, interpretable tool for auditing data and analyzing model behaviors. The approach relies on state-of-the-art vision–language foundations (BLIP-2, LLaVA, GPT-4, CLIP) and emphasizes human-in-the-loop interpretation and robustness considerations for practical deployment.

Abstract

How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning. This task takes in image sets $D_A$ and $D_B$, and outputs a description that is more often true on $D_A$ than $D_B$. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing model failure modes (supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.

Describing Differences in Image Sets with Natural Language

TL;DR

This paper defines Set Difference Captioning (SDC), where given two image sets

and

, the goal is to describe differences that are more characteristic of

. It introduces VisDiff, a two-stage proposer–ranker system that first generates candidate differences from small image subsets using captioning and large language models, then ranks them with a CLIP-based signal across the full sets. A new benchmark, VisDiffBench, with 187 paired image sets and a ground-truth difference description, enables systematic evaluation and ablations across proposers, rankers, and noise conditions. Across multiple applications—from dataset and model comparisons to memorability analyses—VisDiff uncovers meaningful, previously unknown differences, highlighting its utility as an automated, interpretable tool for auditing data and analyzing model behaviors. The approach relies on state-of-the-art vision–language foundations (BLIP-2, LLaVA, GPT-4, CLIP) and emphasizes human-in-the-loop interpretation and robustness considerations for practical deployment.

Abstract

of images, which we term Set Difference Captioning. This task takes in image sets

and

, and outputs a description that is more often true on

than

. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing model failure modes (supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.

Paper Structure (70 sections, 16 figures, 11 tables)

This paper contains 70 sections, 16 figures, 11 tables.

Introduction
Related Works
Set Difference Captioning
Task Definition
Benchmark
Evaluation
Our Method: VisDiff
Proposer
Ranker
Results
Which Proposer is Best?
Which Ranker is Best?
Can Algorithm Find True Difference?
Performance Under Noisy Data Splits
Other ablations of VisDiff algorithm.
...and 55 more sections

Figures (16)

Figure 1: Set difference captioning. Given two sets of images $\mathcal{D}_A$ and $\mathcal{D}_B$, output natural language descriptions of concepts which are more true for $\mathcal{D}_A$. In this example, $\mathcal{D}_A$ and $\mathcal{D}_B$ are images from the "Dining Table" class in ImageNetV2 and ImageNet, respectively.
Figure 2: VisDiff algorithm. VisDiff consists of a GPT-4 proposer on BLIP-2 generated captions and a CLIP ranker. The proposer takes randomly sampled image captions from $\mathcal{D}_A$ and $\mathcal{D}_B$ and proposes candidate differences. The ranker takes these proposed differences and evaluates them across all the images in $\mathcal{D}_A$ and $\mathcal{D}_B$ to assess which ones are most true.
Figure 3: Top 5 descriptions generated by the caption-based, image-based, and feature-based proposer. All the top 5 descriptions from the caption-based proposer and the top 2 from the image-based proposer identify the ground-truth difference between "practicing yoga" and "meditating", while feature-based fails. We report AUROC scores from the same feature-based ranker described in \ref{['sec:ranker-method']}.
Figure 4: VisDiff performance under noise. We randomly swap different percentages of images between $\mathcal{D}_A$ and $\mathcal{D}_B$ to inject noise. Results are computed on 50 paired sets in PairedImageSets-Hard. 95% confidence intervals are reported over three runs.
Figure 5: StableDiffusionV2 vs. V1 generated images. For the same prompt, StableDiffusionV2 images often contain more "vibrant contrasting colors" and "artworks placed on stands or in frames". Randomly sampled images can be found in \ref{['supp_fig:diffusion_random_samples']}.
...and 11 more figures

Describing Differences in Image Sets with Natural Language

TL;DR

Abstract

Describing Differences in Image Sets with Natural Language

Authors

TL;DR

Abstract

Table of Contents

Figures (16)