On the Limitations of Vision-Language Models in Understanding Image Transforms
Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz
TL;DR
This paper probes whether Vision Language Models can understand simple image transforms by creating an augmented Flickr8k dataset and evaluating CLIP and SigLIP across three tasks: linking transformation descriptions to images, matching augmented images to augmented prompts, and classifying the transformation type. The authors find that current VLMs exhibit limited explicit understanding of image transformations, with model performance varying by category and task, and with transformation classification being particularly weak. The results imply a gap that constrains downstream tasks like image editing, where reliable spatial manipulation understanding is crucial. The work motivates new training paradigms that balance invariance with explicit transformation awareness to enhance robust visual reasoning in multimodal systems.
Abstract
Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.
