Table of Contents
Fetching ...

The Benefits of Balance: From Information Projections to Variance Reduction

Lang Liu, Ronak Mehta, Soumik Pal, Zaid Harchaoui

TL;DR

This paper reveals that data balancing across modalities in self-supervised and foundation-model training acts as a variance-reduction mechanism. It formalizes a two-phase process that maps observed data to derived multimodal pairs and applies Sinkhorn-type iterative balancing to target marginals, producing a model-dependent distribution $P_{n,\theta}$. The main theoretical contribution is a non-asymptotic MSE bound for the balanced estimator, decomposing into an $O(n^{-1})$ variance term governed by the spectral properties of conditional-mean operators $\mu_X$ and $\mu_Y$ and an $O(k^6 n^{-3/2})$ higher-order term, with a geometric decay tied to singular values $s_j$ under a positive spectral gap. The results connect balancing to SSL objectives like CLIP and SwaV, enabling variance-aware interpretations of practical techniques and enabling improved design of balancing steps in training. Empirically, the authors demonstrate variance-reduction-driven gains in zero-shot classification and metadata-curation tasks, validating the theory and suggesting that careful marginal balancing can enhance multimodal learning systems at scale.

Abstract

Data balancing across multiple modalities and sources appears in various forms in foundation models in machine learning and AI, e.g. in CLIP and DINO. We show that data balancing across modalities and sources actually offers an unsuspected benefit: variance reduction. We present a non-asymptotic statistical bound that quantifies this variance reduction effect and relates it to the eigenvalue decay of Markov operators. Furthermore, we describe how various forms of data balancing in contrastive multimodal learning and self-supervised clustering can be better understood, and even improved upon, owing to our variance reduction viewpoint.

The Benefits of Balance: From Information Projections to Variance Reduction

TL;DR

This paper reveals that data balancing across modalities in self-supervised and foundation-model training acts as a variance-reduction mechanism. It formalizes a two-phase process that maps observed data to derived multimodal pairs and applies Sinkhorn-type iterative balancing to target marginals, producing a model-dependent distribution . The main theoretical contribution is a non-asymptotic MSE bound for the balanced estimator, decomposing into an variance term governed by the spectral properties of conditional-mean operators and and an higher-order term, with a geometric decay tied to singular values under a positive spectral gap. The results connect balancing to SSL objectives like CLIP and SwaV, enabling variance-aware interpretations of practical techniques and enabling improved design of balancing steps in training. Empirically, the authors demonstrate variance-reduction-driven gains in zero-shot classification and metadata-curation tasks, validating the theory and suggesting that careful marginal balancing can enhance multimodal learning systems at scale.

Abstract

Data balancing across multiple modalities and sources appears in various forms in foundation models in machine learning and AI, e.g. in CLIP and DINO. We show that data balancing across modalities and sources actually offers an unsuspected benefit: variance reduction. We present a non-asymptotic statistical bound that quantifies this variance reduction effect and relates it to the eigenvalue decay of Markov operators. Furthermore, we describe how various forms of data balancing in contrastive multimodal learning and self-supervised clustering can be better understood, and even improved upon, owing to our variance reduction viewpoint.
Paper Structure (59 sections, 32 theorems, 241 equations, 8 figures, 1 table)

This paper contains 59 sections, 32 theorems, 241 equations, 8 figures, 1 table.

Key Result

Theorem 1

For a sequence of data balancing estimators $(\varphi_n^{(k)})_{k \geq 1}$ as defined in eq:est_framework, there exists an absolute constant $C > 0$ and distribution dependent constant $s \in [0, 1)$ and such the following holds for $\sigma_{\text{gap}}^2 = \sigma_0^2 - \sigma_k^2$: For $n \geq C[\l

Figures (8)

  • Figure 1: Data Balancing Examples: Each panel shows a possible distribution $Q$ on different choices of ($\mathcal{X}, \mathcal{Y}$). The orange histograms are the target marginal $P_Y$. Left:$Q(x, y)$ is the affinity of an image $x$ for cluster $y$. Center:$Q(x, y)$ is the similarity of an image $x$ to a text caption $y$. Right:$Q(x, y)$ is the proportion of substring matches between a text caption $x$ and a keyword $y$.
  • Figure 2: Data Balancing. Nonlinear and linear operators associated with each iteration of \ref{['eq:raking']}. Left: Visualization of the exact iterations of \ref{['eq:raking']} in the space of probability measures. The blue set contains joint distributions with $\mathcal{X}$-marginal equal to $P_X$, whereas the orange set contains joint distributions with $\mathcal{Y}$-marginal equal to $P_Y$. Right: Visualization of $\mathbf{L}^2(P)$, the operators defining \ref{['eq:variance_k']}, and the singular values given in \ref{['eq:svd1']}.
  • Figure 3: Zero-Shot Classification Performance across Embeddings, Batch Sizes, and Objectives. The three vertical panels describe different choices of the text encoder $f_{\theta_T}$ which increases in quality from left to right; that is, pre-trained GPT-2, BERT, and CLIP embeddings, respectively. Within each vertical panel, examples include batch sizes $m = 128$ and $m=512$. Rows indicate various evaluation datasets from CIFAR-10, CIFAR-100, and STL-10. The $y$-axis of each plot indicates average per-class recall, whereas the $x$-axis indicates training iterations at the given batch size.
  • Figure 4: Balancing and Metadata Curation. Depiction of balancing and metadata curation (Example 3 in \ref{['sec:ssl']}) on ImageNet-Captions dataset, in which $\mathcal{X}$ represents image-caption pairs and $\mathcal{Y}$ represents keywords. Left: Observed marginal $P_{n, Y}$ (orange) and $P_Y$ (blue), which are sorted by order of increasing probability. Right: Zero-shot evaluation of an embedding model trained using the standard CLIP loss original versus the balanced training set.
  • Figure 5: Baseline Comparisons across Dependence and Misspecification Levels. Each line refers to a combination of an estimation method (the empirical probability measure $P_n$, the estimator $P^{\mathrm{IPWI}}_n$ from \ref{['eq:ipwi']}, or the balancing estimator $P_n^{(k)}$ for $k = 8$) and a noise level on the provided marginals (see \ref{['eq:misspecification']}). The $y$-axis shows the mean squared error of estimating a linear functional. The $x$-axis represents the dependence level $s = s_2$ (i.e. the leading singular value other than $s_1 = 1$).
  • ...and 3 more figures

Theorems & Definitions (58)

  • Theorem 1
  • Corollary 2
  • Proposition 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Proposition 5
  • proof
  • ...and 48 more