Table of Contents
Fetching ...

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan

TL;DR

The Fixed-frame Modality Gap Theory is proposed, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals and introduces ReAlign, a training-free modality alignment strategy.

Abstract

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

TL;DR

The Fixed-frame Modality Gap Theory is proposed, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals and introduces ReAlign, a training-free modality alignment strategy.

Abstract

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
Paper Structure (74 sections, 10 theorems, 62 equations, 13 figures, 4 tables)

This paper contains 74 sections, 10 theorems, 62 equations, 13 figures, 4 tables.

Key Result

Lemma B.1

For each $i$, the gradient with respect to the anchor embedding satisfies: Symmetrically, for each $j$:

Figures (13)

  • Figure 1: Geometric decomposition of the modality gap. We characterize the intrinsic shape of the modality gap within a frozen reference frame. Unlike prior isotropic assumptions, we reveal that the gap consists of a systematic Stable Bias and Anisotropic Residuals. This precise modeling serves as the theoretical foundation for our statistical alignment strategy.
  • Figure 2: Geometric Statistics of the Modality Gap. (a) Geometric Gradient Constraint. The reference leakage ratio (blue) closely tracks the geometric baseline $\sin \theta(U_t, U)$ (red), confirming that gradients are confined within the evolving task subspace $U_t$. (b) Passive Bias Evolution. The orthogonal bias component $\gamma(t)$ exhibits high cosine stability (blue) with only slow cumulative drift (red), indicating a passive evolution driven by subspace rotation rather than direct optimization. (c) Semantic Signal Locking (U-side). In the semantic subspace $U$, the condition number $\kappa(\Sigma_U)$ (blue) remains extremely high ($>10^3$), showing strong anisotropy. The correlation $\rho_{\text{align}}$ (red) rapidly converges to $\approx 1$, confirming that residual variance is locked to the gradient covariance structure. (d) Orthogonal Noise Decoupling (V-side). In the orthogonal subspace $V$, the residual noise maintains a stretched shape ($\kappa > 10^1$, blue). Crucially, the bias vector $\gamma$ maintains an angle of $\approx 90^\circ$ (red) relative to the principal noise direction, proving that the static bias and dynamic noise are geometrically decoupled and orthogonal.
  • Figure 3: The ReAlign Pipeline. (a) Original State. The Source modality ($y$) and Target modality ($x$) exhibit discrepancies in both mean centroids and global trace. (b) Step 1: Anchor Alignment. The source is centered and shifted to the target anchor $\mu_x$ to eliminate first-order bias. (c) Step 2: Trace Alignment. Embeddings are scaled to match the target trace $\mathcal{T}_x$ via a linear affine transformation. Note that the subsequent spherical projection induces a non-linear centroid drift $\mu'$. (d) Step 3: Centroid Alignment. An explicit correction rectifies this drift ($e"_y = e'_y - \mu' + \mu_x$), realigning the mass center to the stable reference. (e) Final Output. Final re-normalization yields $\hat{e}_y$, ensuring precise distribution alignment on the unit manifold.
  • Figure 4: We measure the modality gap between aligned centroids on Bunny and DenseFusion. While the baseline C3 stagnates at a geometric bottleneck ($\approx 0.0023$) due to isotropic assumptions, ReAlign reduces the gap to the $10^{-4}$ scale by effectively modeling anisotropic covariance.
  • Figure 5: Geometric Fidelity Analysis via Spectral and Angular Properties. (a) Semantic Hierarchy: The eigenspectrum analysis reveals that C$^3$ (red line) exhibits a flattened slope with an elevated tail ($\alpha \approx 1.06$), indicating that unstructured noise injection dilutes fine-grained semantic structure. In contrast, ReAlign (blue line) maintains a power-law decay ($\alpha \approx 1.33$) that matches the intrinsic geometry of the source text. (b) Angular Topology Matching: KDE plots of cosine similarities demonstrate that C$^3$ causes a severe distributional shift (JS Divergence = 0.1924), destroying angular relationships. ReAlign achieves a near-perfect overlap with the target prior (JS Divergence = 0.0066), validating its ability to restore centroid alignment while preserving the topological structure.
  • ...and 8 more figures

Theorems & Definitions (19)

  • Lemma B.1: InfoNCE embedding-gradient is a linear combination of the contrastive set
  • proof
  • Corollary B.2: Consistency with Main Text
  • Lemma B.3: Leakage equals $\sin\theta(U_t,U)$
  • proof
  • Proposition B.4: Projected Mean Increment
  • proof
  • Lemma B.5: Central Symmetry Implies Zero Mean
  • Proposition B.6: Phantom Drift Mechanism
  • Remark B.1
  • ...and 9 more