The Benefits of Balance: From Information Projections to Variance Reduction
Lang Liu, Ronak Mehta, Soumik Pal, Zaid Harchaoui
TL;DR
This paper reveals that data balancing across modalities in self-supervised and foundation-model training acts as a variance-reduction mechanism. It formalizes a two-phase process that maps observed data to derived multimodal pairs and applies Sinkhorn-type iterative balancing to target marginals, producing a model-dependent distribution $P_{n,\theta}$. The main theoretical contribution is a non-asymptotic MSE bound for the balanced estimator, decomposing into an $O(n^{-1})$ variance term governed by the spectral properties of conditional-mean operators $\mu_X$ and $\mu_Y$ and an $O(k^6 n^{-3/2})$ higher-order term, with a geometric decay tied to singular values $s_j$ under a positive spectral gap. The results connect balancing to SSL objectives like CLIP and SwaV, enabling variance-aware interpretations of practical techniques and enabling improved design of balancing steps in training. Empirically, the authors demonstrate variance-reduction-driven gains in zero-shot classification and metadata-curation tasks, validating the theory and suggesting that careful marginal balancing can enhance multimodal learning systems at scale.
Abstract
Data balancing across multiple modalities and sources appears in various forms in foundation models in machine learning and AI, e.g. in CLIP and DINO. We show that data balancing across modalities and sources actually offers an unsuspected benefit: variance reduction. We present a non-asymptotic statistical bound that quantifies this variance reduction effect and relates it to the eigenvalue decay of Markov operators. Furthermore, we describe how various forms of data balancing in contrastive multimodal learning and self-supervised clustering can be better understood, and even improved upon, owing to our variance reduction viewpoint.
