VladVA: Discriminative Fine-tuning of LVLMs
Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos
TL;DR
VladVA addresses the gap between discriminative VLMs and LVLMs by converting a generative LVLM into a discriminative model through a hybrid training framework that leverages short captions for contrastive alignment and long captions for autoregressive learning on variable-length image-text data. It employs parameter-efficient adaptation with soft prompts and LoRA adapters, enabling effective fine-tuning with modest data. Empirical results on zero-shot retrieval and compositionality benchmarks show VladVA surpassing state-of-the-art CLIP-like models of similar size and achieving notable gains on complex reasoning tasks, including Winoground and SugarCrepe. The work demonstrates that LVLMs can deliver strong discriminative performance while preserving generative capabilities, offering practical benefits for retrieval and multi-turn visual reasoning, and suggesting scalable avenues with larger data and diverse architectures.
Abstract
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
