MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

Wanyun Xie; Francesco Tonin; Volkan Cevher

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

Wanyun Xie, Francesco Tonin, Volkan Cevher

TL;DR

MaD-Mix addresses the challenge of efficient, principled data mixing for vision-language model (VLM) training by formulating data mixtures as modality-aware domain alignment in a shared latent space, and solving it via a Fenchel-dual, yielding closed-form domain alignment scores. It extends to missing modalities by decoupling absent data from the objective and computes domain weights that drive sampling without costly tuning; the final weights are obtained through a spectral soft-thresholding of the multi-modal kernel. Empirical results on 0.5B and 7B VLMs show MaD-Mix matches or surpasses expert-tuned mixtures with substantially fewer training steps and negligible overhead, and scales to tri-modal video settings with large gains. The method transfers domain weights across model sizes and architectures, offering a scalable, plug-and-play approach to data mixture design for modern VLM pipelines.

Abstract

Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

TL;DR

Abstract

Paper Structure (47 sections, 2 theorems, 28 equations, 4 figures, 15 tables, 1 algorithm)

This paper contains 47 sections, 2 theorems, 28 equations, 4 figures, 15 tables, 1 algorithm.

Introduction
Related Works
Data composition in VLMs.
Data mixing.
Multi-modal Mixing with Modality-aware Domain Alignment
Setup and objective.
Multi-modal domain alignment scores
Interpretation.
Multi-modal scores with missing modalities
Computational complexity & practical implementation.
Experiments
Training setup.
Baselines.
Evaluation benchmarks.
MaD-Mix improves both 0.5B and 7B VLMs
...and 32 more sections

Key Result

Proposition 3.2

Define the multi-modal kernel matrix as $K_\mathrm{MM}\xspace \in \mathbb{R}^{k \times k}$ with entries $K\xspace_{\mathrm{MM}_{ ij}} = \sum_{v=1}^V K\xspace_{ij}^{[v]}$, where $K\xspace_{ij}^{[v]} =(x_i^{[v]})^\top x_j^{[v]}$. The optimal latent variables for the multi-modal objective are given by: where $\delta = [\delta_1, ..., \delta_k]^\top$ with entries $\delta_{i} = \sum_{v=1}^V \delta_i^{[

Figures (4)

Figure 1: Pipeline of multi-modal data mixing for VLM training.A Modality-specific embeddings $x_i^{[v]}$ are extracted from the midstage trained model for each domain. Some domains may lack certain modalities (e.g., the language domain has no image data). B The $k$ domains are then mapped to a shared multi-modal space by the coupling latent variables $\alpha\xspace$ of the multi-modal alignment objective \ref{['eq:mm_objective']}. C The multi-modal kernel matrix $K_\mathrm{MM}\xspace$ is computed as the pairwise inner products between domain embeddings across modalities via \ref{['eq:h:multi-modal']}. Finally, \ref{['eq:final_score']} is applied to $K_\mathrm{MM}\xspace$ and $\alpha\xspace$ to obtain score $S_i, \, i=1,\ldots,k$ indicating the multi-modal alignment of each domain. A resampling non-uniform distribution $p$ is obtained by softmax-normalizing the scores. D Finally, image-text instruction tuning of the target VLM is carried out by sampling according to the obtained data mixture $p$.
Figure 2: Comparison of different data mixture strategies in the image-text instruction tuning. (Left) Domain weights for uniform, human, and MaD-Mix. (Right) Zero-shot average downstream accuracy of 0.5B models, where MaD-Mix achieves consistent improvement.
Figure 3: Comparison of different data mixtures in the video-image-text instruction tuning. (Left) Domain weights for uniform and MaD-Mix. (Right) Zero-shot average downstream accuracy of 0.5B models, where MaD-Mix outperforms Uniform during the whole training process.
Figure 4: Embedding kernel similarity matrix for different modalities.

Theorems & Definitions (6)

Remark 3.1: Beamforming
Proposition 3.2: Multi-modal scores
Remark 3.3: Spectral characterization
Remark A.1: Efficient computation of the score
Lemma A.2
proof

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

TL;DR

Abstract

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)