Table of Contents
Fetching ...

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo

TL;DR

This paper introduces Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens and demonstrates that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets.

Abstract

Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies. Codes are available at https://github.com/foundation-multimodal-models/CAL.

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

TL;DR

This paper introduces Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens and demonstrates that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets.

Abstract

Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies. Codes are available at https://github.com/foundation-multimodal-models/CAL.
Paper Structure (39 sections, 5 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 39 sections, 5 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Figure \ref{['fig:tokens_diff_a']} is one sample drawn from the ShareGPT4V dataset, which contains text tokens that are even contradictory with the given image. Figure \ref{['fig:human_eval']} further presents our human evaluation results on the proportion of noisy samples that contain contradictory tokens.
  • Figure 2: Overview of CAL . Figure \ref{['fig:weight_sub_a']} presents a sample drawn from the ShareGPT4V dataset. We calculate the logit difference w/ or w/o image inputs and plot the heat map on partial text tokens. Figure \ref{['fig:weight_sub_b']} presents the training procedure of CAL , which re-weights the importance of label tokens based on the contrasting logits.
  • Figure 3: Accuracy difference when different noise ratios applied. The performance of the baseline is marked with red lines, and CAL is marked with green lines. The dashed line represents the asymptote.
  • Figure 4: $\Delta \mathbf{o}$ distribution for LLaVA models on 100 random sampled cases.
  • Figure 5: Comparison of attention maps with and without CAL on LLaVA-NeXT-13B. The left side of each sub-figure shows LLaVA-NeXT-13B without CAL , while the right side shows LLaVA-NeXT-13B with CAL .
  • ...and 2 more figures