Table of Contents
Fetching ...

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

Youngtaek Oh, Pyunghwan Ahn, Jinhyung Kim, Gwangmo Song, Soonyoung Lee, In So Kweon, Junmo Kim

TL;DR

The intricate relationship between compositionality and recognition -- two pivotal aspects of VLM capability -- is explored, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality.

Abstract

Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment. This paper explores the intricate relationship between compositionality and recognition -- two pivotal aspects of VLM capability. We conduct a comprehensive evaluation of existing VLMs, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality. Our evaluation employs 12 benchmarks for compositionality, along with 21 zero-shot classification and two retrieval benchmarks for recognition. In our analysis from 274 CLIP model checkpoints, we reveal patterns and trade-offs that emerge between compositional understanding and recognition accuracy. Ultimately, this necessitates strategic efforts towards developing models that improve both capabilities, as well as the meticulous formulation of benchmarks for compositionality. We open our evaluation framework at https://github.com/ytaek-oh/vl_compo.

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

TL;DR

The intricate relationship between compositionality and recognition -- two pivotal aspects of VLM capability -- is explored, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality.

Abstract

Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment. This paper explores the intricate relationship between compositionality and recognition -- two pivotal aspects of VLM capability. We conduct a comprehensive evaluation of existing VLMs, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality. Our evaluation employs 12 benchmarks for compositionality, along with 21 zero-shot classification and two retrieval benchmarks for recognition. In our analysis from 274 CLIP model checkpoints, we reveal patterns and trade-offs that emerge between compositional understanding and recognition accuracy. Ultimately, this necessitates strategic efforts towards developing models that improve both capabilities, as well as the meticulous formulation of benchmarks for compositionality. We open our evaluation framework at https://github.com/ytaek-oh/vl_compo.
Paper Structure (11 sections, 8 figures, 5 tables)

This paper contains 11 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: A comprehensive overview of the trend between compositionality and recognition. Pre-trained VLMs (in blue points) exhibit improved compositionality alongside enhanced zero-shot classification. Conversely, models fine-tuned for compositionality (in green and orange points) demonstrate trade-offs between these two capabilities. A detailed analysis is provided in \ref{['sec:experiments']}.
  • Figure 2: Nuanced trade-offs between compositionality and recognition in VLMs. While enhanced pre-training methods can lead to improvements in both areas (e.g., SLIP and DeCLIP), this is not always the case. \ref{['sec:pretrained']} provides a detailed explanation.
  • Figure 3: Data scaling property in compositionality tasks. (Left) Compositionality improves along with the scale of pre-training data. (Right) Pre-training with data that include real images tends to yield better efficiency in compositionality against recognition.
  • Figure 4: Exploring fine-tuning effects on compositionality and recognition through the lens of weight-space ensembling (WiSE-FT wortsman2022robust). (Left) Increased compositionality comes at the cost of zero-shot classification accuracy. For retrieval tasks, (Center) fine-tuned models with COCO (e.g., NegCLIP and CE-CLIP) enhances I2T recall in the initial stages, whereas fine-tuning with datasets less akin to COCO, such as CC3M (e.g., TSVLC and DAC), can result in noticeable drops. (Right) Consistent across all models, fine-tuning benefits T2I recall.
  • Figure 5: A comparative analysis of pre-trained and fine-tuned CLIP models in both compositionality and recognition at a benchmark-level. Pre-trained models generally show a positive correlation between compositionality and zero-shot classification with the exception of WhatsUp benchmark. In contrast, fine-tuned models exhibit mixed properties across different benchmarks.
  • ...and 3 more figures