MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training
Wanyun Xie, Francesco Tonin, Volkan Cevher
TL;DR
MaD-Mix addresses the challenge of efficient, principled data mixing for vision-language model (VLM) training by formulating data mixtures as modality-aware domain alignment in a shared latent space, and solving it via a Fenchel-dual, yielding closed-form domain alignment scores. It extends to missing modalities by decoupling absent data from the objective and computes domain weights that drive sampling without costly tuning; the final weights are obtained through a spectral soft-thresholding of the multi-modal kernel. Empirical results on 0.5B and 7B VLMs show MaD-Mix matches or surpasses expert-tuned mixtures with substantially fewer training steps and negligible overhead, and scales to tri-modal video settings with large gains. The method transfers domain weights across model sizes and architectures, offering a scalable, plug-and-play approach to data mixture design for modern VLM pipelines.
Abstract
Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.
