Table of Contents
Fetching ...

VladVA: Discriminative Fine-tuning of LVLMs

Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos

TL;DR

VladVA addresses the gap between discriminative VLMs and LVLMs by converting a generative LVLM into a discriminative model through a hybrid training framework that leverages short captions for contrastive alignment and long captions for autoregressive learning on variable-length image-text data. It employs parameter-efficient adaptation with soft prompts and LoRA adapters, enabling effective fine-tuning with modest data. Empirical results on zero-shot retrieval and compositionality benchmarks show VladVA surpassing state-of-the-art CLIP-like models of similar size and achieving notable gains on complex reasoning tasks, including Winoground and SugarCrepe. The work demonstrates that LVLMs can deliver strong discriminative performance while preserving generative capabilities, offering practical benefits for retrieval and multi-turn visual reasoning, and suggesting scalable avenues with larger data and diverse architectures.

Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

VladVA: Discriminative Fine-tuning of LVLMs

TL;DR

VladVA addresses the gap between discriminative VLMs and LVLMs by converting a generative LVLM into a discriminative model through a hybrid training framework that leverages short captions for contrastive alignment and long captions for autoregressive learning on variable-length image-text data. It employs parameter-efficient adaptation with soft prompts and LoRA adapters, enabling effective fine-tuning with modest data. Empirical results on zero-shot retrieval and compositionality benchmarks show VladVA surpassing state-of-the-art CLIP-like models of similar size and achieving notable gains on complex reasoning tasks, including Winoground and SugarCrepe. The work demonstrates that LVLMs can deliver strong discriminative performance while preserving generative capabilities, offering practical benefits for retrieval and multi-turn visual reasoning, and suggesting scalable avenues with larger data and diverse architectures.

Abstract

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Paper Structure

This paper contains 24 sections, 2 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Overall VladVA framework: a generative LVLM is adapted into a discriminative model with the help of (1) a contrastive training loss (Sec. \ref{['ssec:method-c']}), and (2) an autoregressive loss (Sec. \ref{['ssec:method-ar']}). The first one is applied on image-text pairs with short(er) captions, encouraging the last token produced by both modalities to be discriminative. The second one, jointly optimized with the first one, is applied only on longer captions and allows the model to learn fine-grained details.
  • Figure 2: Entropy of the output probability distribution at the next-to-be-predicted token location using a LLaVA-1.5-7B for a set of 50 prompts for both images and captions.
  • Figure 3: Cumulative variance of the image and text embedding matrices over a set of 50 prompts on Flickr30k. Embeddings that capture more information about the input translate into a cumulative variance that requires more principal components to be explained, i.e. a higher-rank embedding matrix.
  • Figure 4: Top-k next-to-be-predicted tokens before and after VladVA fine-tuning (our approach). On the right, we show the output probability distribution for each case. When using the best prompt ("Summarize the provided image in one word"), the representations of the next token can encode diverse and more discriminative information, making potentially better-quality embeddings. This behavior is further improved after VladVA fine-tuning.
  • Figure 5: Image and text retrieval score on Flickr30k over a set of 50 image-text prompts ordered by their entropy scores (Fig. \ref{['fig:entropy']}). We can observe that prompts with high average entropy scores correlate positively with the zero-shot retrieval performance.
  • ...and 2 more figures