Table of Contents
Fetching ...

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

TL;DR

C$^3$L tackles the problem that VLIT data generated by LVLMs often poorly align with the actual images due to language priors and exposure bias. It introduces a Content Relevance Module that computes Image Instruction Correspondence scores $S(I^2C)$ to filter and reweight data, and a Contrastive Learning Module that leverages positive and negative pseudo-labels to mitigate exposure bias and enhance data-generation quality. The augmented VLIT pipeline combines these modules to produce a compact 5k VLIT dataset that yields competitive or superior results on SEED, MMB, LLaVA$^W$, and POPE benchmarks with reduced computational cost. This approach improves cross-modal alignment and offers a practical data-generation paradigm for LVLM fine-tuning with limited labeled VLIT data.

Abstract

Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

TL;DR

CL tackles the problem that VLIT data generated by LVLMs often poorly align with the actual images due to language priors and exposure bias. It introduces a Content Relevance Module that computes Image Instruction Correspondence scores to filter and reweight data, and a Contrastive Learning Module that leverages positive and negative pseudo-labels to mitigate exposure bias and enhance data-generation quality. The augmented VLIT pipeline combines these modules to produce a compact 5k VLIT dataset that yields competitive or superior results on SEED, MMB, LLaVA, and POPE benchmarks with reduced computational cost. This approach improves cross-modal alignment and offers a practical data-generation paradigm for LVLM fine-tuning with limited labeled VLIT data.

Abstract

Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.
Paper Structure (29 sections, 3 equations, 4 figures, 5 tables)

This paper contains 29 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The illustration of the prior language knowledge problem when directly using current LVLMs to generate VLIT data. Existing models tend to generate data that exhibits low content relevance with the corresponding images (denoted in red). Our method effectively enhances the content relevance between VLIT data and images (denoted in green).
  • Figure 2: Overview of our Content Correlated VLIT data generation via Contrastive Learning (C$^3$L). Given the initial dataset and corresponding images, we first use Content Relevance module to obtain the $I^2C$ scores based on whether or not the image is provided. Then, positive-negative pseudo-labels are selected based on $I^2C$ scores. Further, our Contrastive Learning module maximize the similarity between the anchor and positive pseudo-label while minimizing the similarity between the anchor and negative pseudo-labels.
  • Figure 3: Alternative data selection proportions testing. We conduct experiments on LLaVA$\rm ^W$ to test the effects of different data selection proportions.
  • Figure 4: Generation results. We show the VLIT data generated w/o $C^3L$ and w/ $C^3L$.