Table of Contents
Fetching ...

Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories

Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman

TL;DR

This work tackles data redundancy in fine-tuning large vision-language models by linking example gradients to cross-modal attention. It introduces XMAS, which uses a proxy VLM to track cross-modal alignment trajectories via the top singular values of cross-modal attention, clusters examples by trajectory similarity, and samples a balanced, stable subset for training. Theoretical results bound gradient differences by attention-distance signals across checkpoints and establish convergence-type guarantees for subset-based fine-tuning. Empirically, XMAS achieves substantial data reductions (e.g., 50% on LLaVA-665k and 85% on Vision-Flan) while preserving or matching full-data performance on ten downstream benchmarks and speeds up training by about 1.2x, outperforming all baselines. These findings offer a practical, scalable approach to data-efficient LVLM instruction tuning with strong theoretical grounding.

Abstract

Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories

TL;DR

This work tackles data redundancy in fine-tuning large vision-language models by linking example gradients to cross-modal attention. It introduces XMAS, which uses a proxy VLM to track cross-modal alignment trajectories via the top singular values of cross-modal attention, clusters examples by trajectory similarity, and samples a balanced, stable subset for training. Theoretical results bound gradient differences by attention-distance signals across checkpoints and establish convergence-type guarantees for subset-based fine-tuning. Empirically, XMAS achieves substantial data reductions (e.g., 50% on LLaVA-665k and 85% on Vision-Flan) while preserving or matching full-data performance on ten downstream benchmarks and speeds up training by about 1.2x, outperforming all baselines. These findings offer a practical, scalable approach to data-efficient LVLM instruction tuning with strong theoretical grounding.

Abstract

Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

Paper Structure

This paper contains 20 sections, 7 theorems, 48 equations, 10 figures, 5 tables, 1 algorithm.

Key Result

Theorem 4.1

Consider a single-layer transformer with a single attention head with RMS layer normalization, that is trained using the Frobenius norm squared loss. Let $D$ be the hidden dimensionality of the model, $N$ be the number of input tokens, and $c\geq\|\phi^t\|$ be the upper-bound on the norm of model pa

Figures (10)

  • Figure 1: The average relative performance (ARP) of different subsets of (left) LLaVA 665k and (right) Vision-Flan when fine-tuning LLaVA-1.5-7B. Methods that outperform random selection are shown in opaque. XMAS is the only method that surpasses random selection across different budgets on both datasets. MP and HL finetune the target model on full data to find subsets and thus do not yield any speedup.
  • Figure 2: (left) Data reduction to reach 100% Average relative performance (ARP) of LLaVA-1.5-7B on LLaVA 665k. XMAS obtains 30% more data reduction over the best baselines. (middle) ARP ranking of different 10% subsets of LLaVA 665k when fine-tuning LLaVA-1.5-13B. (right) ARP and ratio of total time (selection + training) w.r.t training target model on full dataset for XMAS at different budgets on LLaVA 665k. XMAS reduces training time by a factor of 0.84 (1.2$\times$ speedup) to reach 100% ARP.
  • Figure 3: Average performance relative to full data for 10% subsets of LLaVA 665k found by COINCIDE and XMAS when varying (left) proxy model, (middle) number of checkpoints and (right) number of clusters.
  • Figure 4: Per-cluster Euclidean distance between the cross-modal alignment trajectories of proxy model and target model on LLaVA 665k. Distance between trajectories of proxy and target models is very small.
  • Figure 5: XMAS employs a small proxy VLM to find alignment trajectory for examples in the fine-tuning data. Examples with similar alignment trajectory have similar gradients during instruction tuning. Then, it clusters the alignment trajectories and sample a balanced subset of examples with more stable trajectories from the clusters.
  • ...and 5 more figures

Theorems & Definitions (15)

  • Definition 4.1: Cross-modal alignment score $\sigma$
  • Theorem 4.1
  • Theorem 4.2
  • Definition 4.2: Alignment trajectory.
  • Definition 4.3: Instability score
  • Corollary 4.3: Informal: Convergence of XMAS
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3: zhang2023notes, Theorem 1.8
  • ...and 5 more