Table of Contents
Fetching ...

ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation

Jirayu Burapacheep, Ishan Gaur, Agam Bhatia, Tristan Thrush

TL;DR

ColorSwap presents a Winoground-inspired dataset to probe color–object word-order understanding in multimodal models, constructed via handmade, rule-based, and generative captions paired with diffusion-generated images. Evaluations across image–text matching and visual language models reveal substantial gaps in compositional understanding, with near-random performance on key metrics for many baselines, though minimal finetuning on 1,400–2,000 examples yields notable gains for CLIP and BLIP. The work demonstrates both the brittleness of current models to color-based word-order changes and the potential for rapid improvement through targeted fine-tuning, while offering a scalable data-generation pipeline and a public testbed for future improvements in color comprehension. Overall, ColorSwap provides a practical benchmark and methodological blueprint for enhancing word-order sensitivity in vision–language systems, with implications for AI-generated art, captioning, and multimodal reasoning.

Abstract

This paper introduces the ColorSwap dataset, designed to assess and improve the proficiency of multimodal models in matching objects with their colors. The dataset is comprised of 2,000 unique image-caption pairs, grouped into 1,000 examples. Each example includes a caption-image pair, along with a ``color-swapped'' pair. We follow the Winoground schema: the two captions in an example have the same words, but the color words have been rearranged to modify different objects. The dataset was created through a novel blend of automated caption and image generation with humans in the loop. We evaluate image-text matching (ITM) and visual language models (VLMs) and find that even the latest ones are still not robust at this task. GPT-4V and LLaVA score 72% and 42% on our main VLM metric, although they may improve with more advanced prompting techniques. On the main ITM metric, contrastive models such as CLIP and SigLIP perform close to chance (at 12% and 30%, respectively), although the non-contrastive BLIP ITM model is stronger (87%). We also find that finetuning on fewer than 2,000 examples yields significant performance gains on this out-of-distribution word-order understanding task. The dataset is here: https://github.com/Top34051/colorswap and here: https://huggingface.co/datasets/stanfordnlp/colorswap.

ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation

TL;DR

ColorSwap presents a Winoground-inspired dataset to probe color–object word-order understanding in multimodal models, constructed via handmade, rule-based, and generative captions paired with diffusion-generated images. Evaluations across image–text matching and visual language models reveal substantial gaps in compositional understanding, with near-random performance on key metrics for many baselines, though minimal finetuning on 1,400–2,000 examples yields notable gains for CLIP and BLIP. The work demonstrates both the brittleness of current models to color-based word-order changes and the potential for rapid improvement through targeted fine-tuning, while offering a scalable data-generation pipeline and a public testbed for future improvements in color comprehension. Overall, ColorSwap provides a practical benchmark and methodological blueprint for enhancing word-order sensitivity in vision–language systems, with implications for AI-generated art, captioning, and multimodal reasoning.

Abstract

This paper introduces the ColorSwap dataset, designed to assess and improve the proficiency of multimodal models in matching objects with their colors. The dataset is comprised of 2,000 unique image-caption pairs, grouped into 1,000 examples. Each example includes a caption-image pair, along with a ``color-swapped'' pair. We follow the Winoground schema: the two captions in an example have the same words, but the color words have been rearranged to modify different objects. The dataset was created through a novel blend of automated caption and image generation with humans in the loop. We evaluate image-text matching (ITM) and visual language models (VLMs) and find that even the latest ones are still not robust at this task. GPT-4V and LLaVA score 72% and 42% on our main VLM metric, although they may improve with more advanced prompting techniques. On the main ITM metric, contrastive models such as CLIP and SigLIP perform close to chance (at 12% and 30%, respectively), although the non-contrastive BLIP ITM model is stronger (87%). We also find that finetuning on fewer than 2,000 examples yields significant performance gains on this out-of-distribution word-order understanding task. The dataset is here: https://github.com/Top34051/colorswap and here: https://huggingface.co/datasets/stanfordnlp/colorswap.
Paper Structure (33 sections, 7 figures, 7 tables)

This paper contains 33 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An overview of the ColorSwap dataset creation methodology. The human emoji marks components that require human annotator input.
  • Figure 2: An image from DALL$\cdot$E 3 betker2023improving when given the caption "The key to the shed is blue" (we ensured that the caption was not rewritten by ChatGPT chatgpt). DALL$\cdot$E 3 does not always make this mistake, but it is unreliable. Even though "the shed is blue" is a substring, the full sentence is saying that the key is blue. Our dataset does not target difficult cases where colors modify far objects in the string.
  • Figure 3: Illustration of re-captioning process.
  • Figure 4: Visual language model evaluation prompts. We replace {image} with an image and {caption} with an appropriate caption.
  • Figure 5: Example 19, 28, and 244 of the ColorSwap dataset. The responses are generated by GPT-4V given different captions and images.
  • ...and 2 more figures