Distributionally Robust Multimodal Machine Learning
Peilin Yang, Yu Ma
TL;DR
The paper tackles distributional shifts in multimodal learning and proposes a modality-aware distributionally robust optimization framework that accounts for modality-specific uncertainties and cross-modal correlations via a shared copula and χ^2-ambiguity sets. It derives a closed-form generalization bound r(f∘g, P_X) = E[ℓ] + √B √Var(ℓ) with B = ∑ ρ_k + 2 ∑_{i<j} |γ_{ij}| √(ρ_i ρ_j), along with encoder-aware upper bounds and a minimax lower bound. It contrasts with early fusion by showing decomposed risk bounds and computational advantages, and validates robustness gains on simulations and real-world datasets (journalism and healthcare). These results offer a principled framework for deploying multimodal models in high-stakes settings where uncertainty is unavoidable.
Abstract
We consider the problem of distributionally robust multimodal machine learning. Existing approaches often rely on merging modalities on the feature level (early fusion) or heuristic uncertainty modeling, which downplays modality-aware effects and provide limited insights. We propose a novel distributionally robust optimization (DRO) framework that aims to study both the theoretical and practical insights of multimodal machine learning. We first justify this setup and show the significance of this problem through complexity analysis. We then establish both generalization upper bounds and minimax lower bounds which provide performance guarantees. These results are further extended in settings where we consider encoder-specific error propogations. Empirically, we demonstrate that our approach improves robustness in both simulation settings and real-world datasets. Together, these findings provide a principled foundation for employing multimodal machine learning models in high-stakes applications where uncertainty is unavoidable.
