Table of Contents
Fetching ...

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu

TL;DR

The paper tackles the challenge that vision-language models struggle with fine-grained visual-linguistic understanding, such as object size, position, existence, and count. It introduces SPEC, a progressive image-synthesis pipeline that creates controlled image-text pairs with varying single attributes, enabling symmetrical evaluation of image and text modalities. Through extensive evaluation of four state-of-the-art VLMs, the authors reveal near-chance performance on SPEC and identify a core limitation in pretraining contrastive losses that bias models toward noun-like cues. They propose a simple remedy—a hard negative aware contrastive loss added to CLIP that preserves zero-shot capabilities and yields substantial improvements on SPEC and two additional benchmarks, suggesting the method generalizes to broader fine-grained multimodal reasoning tasks.

Abstract

Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

TL;DR

The paper tackles the challenge that vision-language models struggle with fine-grained visual-linguistic understanding, such as object size, position, existence, and count. It introduces SPEC, a progressive image-synthesis pipeline that creates controlled image-text pairs with varying single attributes, enabling symmetrical evaluation of image and text modalities. Through extensive evaluation of four state-of-the-art VLMs, the authors reveal near-chance performance on SPEC and identify a core limitation in pretraining contrastive losses that bias models toward noun-like cues. They propose a simple remedy—a hard negative aware contrastive loss added to CLIP that preserves zero-shot capabilities and yields substantial improvements on SPEC and two additional benchmarks, suggesting the method generalizes to broader fine-grained multimodal reasoning tasks.

Abstract

Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.
Paper Structure (18 sections, 9 equations, 5 figures, 4 tables)

This paper contains 18 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We conduct a symmetrical assessment of VLMs in fine-grained comprehension, considering both linguistic and visual perspectives. The bars in and represent the image-text matching scores for CLIP clip and our method, respectively. It is evident that CLIP struggles with tasks related to quantity comprehension, whereas our method significantly enhances the model in understanding fine-grained details.
  • Figure 2: The overall illustration of our data progressive construction pipeline. We initiate the process by generating a batch of images containing a single object. Subsequently, we extract the object from the background in the images. Following that, we arrange the background-free images on a blank canvas according to specifications (with control over attributes). Finally, we meticulously fill in the missing background, ensuring consistency across candidates.
  • Figure 3: Ensuring consistency among candidates is crucial to avoid ambiguity. The images above not only differ in quantity but also show a significant variation in the appearance of the objects. Consequently, attributing the model's correctness or errors solely to the understanding of quantity is not convincing.
  • Figure 4: Consistent background inpainting strategy. We first generate an initial background shared by all candidate images. Then, we expand around this region, ensuring consistency in the backgrounds of different images.
  • Figure 5: An overview of the SPEC benchmark. SPEC consists of six distinct subsets, distributed across the dimensions of Size, Position, Existence and Count. Each test case consists of an image candidate set, which differs only in certain visual concept, and a text candidate set, which differs only in corresponding language concept. Due to space constraints, we present a maximum of three images and texts here, however, more comprehensive test cases are available in the supplementary material.