Table of Contents
Fetching ...

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hanchao Yu

TL;DR

CAFe addresses the conflict between representation learning and generative capability in LVLMs by jointly optimizing contrastive embeddings with autoregressive language modeling on a pretrained MLLM. It uses a prompt-based embedding instruction and a joint loss $L = α_{lm} L_{lm} + α_{con} L_{con}$ to produce high-quality multimodal embeddings while preserving fluent generation. Across zero-shot retrieval, multimodal embedding benchmarks, and object-hallucination tests, CAFe achieves state-of-the-art results and reduces object hallucinations, also closing the modality gap in representation space. This unified framework offers a practical path toward LVLMs that excel in both precise retrieval and coherent generation.

Abstract

The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

TL;DR

CAFe addresses the conflict between representation learning and generative capability in LVLMs by jointly optimizing contrastive embeddings with autoregressive language modeling on a pretrained MLLM. It uses a prompt-based embedding instruction and a joint loss to produce high-quality multimodal embeddings while preserving fluent generation. Across zero-shot retrieval, multimodal embedding benchmarks, and object-hallucination tests, CAFe achieves state-of-the-art results and reduces object hallucinations, also closing the modality gap in representation space. This unified framework offers a practical path toward LVLMs that excel in both precise retrieval and coherent generation.

Abstract

The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

Paper Structure

This paper contains 23 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Capabilities of various vision-language models. While encoder-based models, e.g., CLIP, excel in generating vision-text aligned embeddings and show promising results in image-text retrieval, they fall short in producing free-form text and reasoning about retrieved images (left). Conversely, Multimodal Large Language Models (MLLMs) have shown remarkable success in multimodal understanding and generation, but their direct embeddings yield suboptimal retrieval results (middle). CAFe effectively bridges this gap by integrating representation learning and language generation, enabling not only retrieval but also advanced generative capabilities (right).
  • Figure 2: Pipeline of the proposed framework, CAFe. It leverages a pretrained MLLM to jointly encode multimodal input and generate language responses. The model is trained using a weighted combination of contrastive loss and autoregressive language modeling loss on paired multimodal input. Specialized embedding instructions are designed to prompt the MLLM to generate effective embeddings, while language instructions are employed for language generation. The image is from MSCOCO dataset Chen2015MicrosoftCC.
  • Figure 3: Scatter plots of image-text embeddings from MSCOCO Chen2015MicrosoftCC and Flickr30K flickrentitiesijcv datasets. CAFe removes the existing modality gap in multimodal representation space.