Table of Contents
Fetching ...

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu

TL;DR

This paper presents the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs), and proposes evaluating the pre-training quality from the inter-modal distribution distance perspective.

Abstract

We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) \textbf{Effective} to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) \textbf{Robust} toward different training/evaluation data. 3) \textbf{Generalize} across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

TL;DR

This paper presents the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs), and proposes evaluating the pre-training quality from the inter-modal distribution distance perspective.

Abstract

We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) \textbf{Effective} to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) \textbf{Robust} toward different training/evaluation data. 3) \textbf{Generalize} across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.

Paper Structure

This paper contains 20 sections, 5 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Final loss, perplexity (PPL), and in-context evaluation are insufficient indicators of LVLM pre-training quality. We test the effectiveness of these three methods and our proposed MIR on a pre-training data scaling experiment, where we curate $\sim$1.8M GPT-style data from ALLaVA (chen2024allava) and ShareGPT4V-PT (chen2023sharegpt4v) and use different amount of data to pre-train LLaVA-1.5 7B models (liu2024visual). Note that "Model Performance" means the post-SFT (Supervised Fine-tuning) performance on 7 multi-modal benchmarks after we equally apply SFT on these pre-trained models on LLaVA's 665K SFT data. In (a), we report the average loss over the last 50 pre-training steps as the final loss. In (b), the PPL is calculated on 1,000 randomly sampled image-caption pairs from ShareGPT4V. In (c), we apply 2-shot in-context evaluation and force the pre-trained models to response choice on MME (fu2023mme), MMBench (liu2023mmbench), SEED-Img (li2023seed), and report the average scores. We can find that these three metrics fail to measure the pre-training quality while MIR well fits the actual model performance.
  • Figure 2: Current LVLMs show obvious modality gap in the shallow layers. (Left) The t-SNE visualization depicts the significant gap between vision (warm colors) and text (cool colors) tokens at LLaVA-v1.5's embedding space, where we select six types of images (from DocVQA (mathew2021docvqa), ChartQA (masry2022chartqa), InfoVQA (mathew2022infographicvqa) and ShareGPT4V (chen2023sharegpt4v)) and three types of text data (from CNN News, Daily Mail (nallapati2016abstractive) and Code Search Net (husain2019codesearchnet). (Right) The modality gap in different layers of LVLM's language model, which is obtained during computing MIR. For most of LVLMs, the first several layers still strive to narrow the modality gap util the middle layers achieve the alignment.
  • Figure 3: MIR is robust to diverse kinds of data.We compute MIR on pre-trained LLaVA-v1.5 7B model with various inputs to verify whether it is input-agnostic. In (a), we select four kinds of visual contents (human, art, landscape and ocr images) and two kinds of text contents (news, mail text). In (b), we compare the MIR computed on the plain/vicuna v1 conversation template, and image-text relevant/irrelevant pairs. In (c), we select the image-text pairs in pre-training data as "Model Seen" data and unseen pairs as "Model Unseen" data, to depict the per-layer MIR.
  • Figure 4: MIR exhibits the similar convergence properties with training loss, closely corresponds with post-SFT model performance. We pre-train LLaVA-v1.5 7B model with its vanilla setting and report the training loss, MIR, and post-SFT performance on 7 LVLM benchmarks.
  • Figure 5: MIR is robust toward overfitting. We conduct 2-epoch pre-training based on the settings of ShareGPT4V 7B model, and report training loss, MIR, and post-SFT model performance (averaged on 7 LVLM benchmarks) at each steps. The training loss shows a sharp drop at the beginning of the second epoch while the model performance does not. It shows that MIR is more consistent with the post-SFT model performance than training.
  • ...and 4 more figures