Table of Contents
Fetching ...

VisMin: Visual Minimal-Change Understanding

Rabiul Awal, Saba Ahmadi, Le Zhang, Aishwarya Agrawal

TL;DR

VisMin introduces a Visual Minimal-Change Understanding benchmark to probe fine-grained visual-language understanding across object, attribute, count, and spatial-relations changes. It uses an automated pipeline with LLM-guided edits and diffusion-based image editing, followed by four-step human verification, yielding a large training set (64,392 samples) and a challenging benchmark (2,084 samples) in COCO-like scenes. Benchmark results reveal current VLMs struggle with spatial reasoning and counting, while fine-tuning with the minimal-change data substantially improves performance for CLIP and, to a lesser extent, Idefics2, and also enhances general image-text alignment. The work provides a scalable data-generation framework and shows that VisMin data are broadly beneficial for improving fine-grained understanding across multiple benchmarks, with releases of benchmark, data, and model checkpoints to accelerate research.

Abstract

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.

VisMin: Visual Minimal-Change Understanding

TL;DR

VisMin introduces a Visual Minimal-Change Understanding benchmark to probe fine-grained visual-language understanding across object, attribute, count, and spatial-relations changes. It uses an automated pipeline with LLM-guided edits and diffusion-based image editing, followed by four-step human verification, yielding a large training set (64,392 samples) and a challenging benchmark (2,084 samples) in COCO-like scenes. Benchmark results reveal current VLMs struggle with spatial reasoning and counting, while fine-tuning with the minimal-change data substantially improves performance for CLIP and, to a lesser extent, Idefics2, and also enhances general image-text alignment. The work provides a scalable data-generation framework and shows that VisMin data are broadly beneficial for improving fine-grained understanding across multiple benchmarks, with releases of benchmark, data, and model checkpoints to accelerate research.

Abstract

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.
Paper Structure (24 sections, 15 figures, 7 tables)

This paper contains 24 sections, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Overview of our VisMin benchmark. VisMin consists of four types of minimal-changes -- object, attribute, count and spatial relation -- between two image-captions pairs. The evaluation task requires a model to predict the correct image-caption match given: 1) two images and one caption, 2) two captions and one image.
  • Figure 2: Our dataset creation pipeline includes three stages: (i) Minimal-Change Pairs Synthesis: We develop methods for synthesizing minimal-change image-caption pairs involving Objects & Attributes and Counting & Spatial Relations. (ii) Automatic Filtering: An LLM generates questions and answers based on captions, and a VQA model predicts answers from images. Synthetically generated minimal-change data are excluded if answers don't match. (iii) Human Verification: Synthetically generated minimal-change data undergoes a rigorous 4-steps human verification, and only examples passing all stages are included in the benchmark.
  • Figure 3: VisMin categories and subcategories.
  • Figure 4: VisMin fine-tuned models show greater improvements with larger models. The circle radius reflects the number of model parameters.
  • Figure 5: (left) Recall results with ViT-L/14 on COCO benchmark (right) standard benchmark results on Idefics2.
  • ...and 10 more figures