Table of Contents
Fetching ...

On the Limitations of Vision-Language Models in Understanding Image Transforms

Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz

TL;DR

This paper probes whether Vision Language Models can understand simple image transforms by creating an augmented Flickr8k dataset and evaluating CLIP and SigLIP across three tasks: linking transformation descriptions to images, matching augmented images to augmented prompts, and classifying the transformation type. The authors find that current VLMs exhibit limited explicit understanding of image transformations, with model performance varying by category and task, and with transformation classification being particularly weak. The results imply a gap that constrains downstream tasks like image editing, where reliable spatial manipulation understanding is crucial. The work motivates new training paradigms that balance invariance with explicit transformation awareness to enhance robust visual reasoning in multimodal systems.

Abstract

Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

On the Limitations of Vision-Language Models in Understanding Image Transforms

TL;DR

This paper probes whether Vision Language Models can understand simple image transforms by creating an augmented Flickr8k dataset and evaluating CLIP and SigLIP across three tasks: linking transformation descriptions to images, matching augmented images to augmented prompts, and classifying the transformation type. The authors find that current VLMs exhibit limited explicit understanding of image transformations, with model performance varying by category and task, and with transformation classification being particularly weak. The results imply a gap that constrains downstream tasks like image editing, where reliable spatial manipulation understanding is crucial. The work motivates new training paradigms that balance invariance with explicit transformation awareness to enhance robust visual reasoning in multimodal systems.

Abstract

Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

Paper Structure

This paper contains 24 sections, 10 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Comparison of image augmentation understanding between humans and Vision Language Models (CLIP/SigLIP). While humans can recognize and describe image transformations like rotation, brightness adjustment, and contrast changes, Vision Language Models show significant limitations in comprehending these basic image manipulations.
  • Figure 2: Distribution of individual augmentations applied to the Flickr8k dataset. The augmentations span across multiple transformation types including geometric (rotations, flips), color adjustments (brightness, contrast, saturation), clarity modifications (blur, sharpness), and various image processing effects.
  • Figure 3: Distribution of augmentations applied to the dataset. The augmentations are grouped into six primary categories: Geometric (rotations and flips), Color (brightness, contrast, saturation, and hue adjustments), Clarity (blur and sharpness), Distortion (perspective and affine transformations), Size (cropping and stretching), and Processing (noise, solarization, posterization, and other effects).
  • Figure 4: Accuracy comparison of model performance on augmented prompt recognition. Higher values indicate better understanding of the relationship between textual descriptions of transformations and their visual manifestations.
  • Figure 5: Comparison of model performance on augmentations grouped according to their properties.
  • ...and 4 more figures