Table of Contents
Fetching ...

The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

Hongyuan Liu, Qinli Yang, Wen Li, Zhong Zhang, Jiaming Liu, Wei Han, Zhili Qin, Jinxia Guo, Junming Shao

Abstract

Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality ($R^2 = 0.986$), whereas the commonly used Raw Gap is misleading ($R^2 = 0.691$). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With $α_{\text{target}}{=}0.05$, the modality gap is reduced by 66.6\% with only 4.84\% accuracy drop. Under stronger alignment ($α_{\text{target}}{=}0.5$), the gap is reduced by 82.3\%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1\% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.

The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

Abstract

Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality (), whereas the commonly used Raw Gap is misleading (). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With , the modality gap is reduced by 66.6\% with only 4.84\% accuracy drop. Under stronger alignment (), the gap is reduced by 82.3\%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1\% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.

Paper Structure

This paper contains 25 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The t-SNE visualization of multimodal features (top) and downstream performance of different methods (bottom). (a) Original CLIP: image and text embeddings form separated islands. (b) Mean-Centeringliang2022mindgrclip2025i0t2025yamashita2025bridging: centroids overlap but distributional mismatch persists. (c) Our alignment method (TPC-CMA, $\alpha_{\text{target}}{=}0.5$): true semantic interleaving. Each panel reports the Modality Gap, defined as the residual cosine distance after centroid alignment (see Section \ref{['sec:gap_decomp']}); lower values indicate better structural alignment. Bottom: downstream impact on two zero-shot tasks. (d) Image captioning CIDEr improves from 0.210 to 0.330. (e) Joint clustering ARI improves from 0.318 to 0.516.
  • Figure 2: Overview of TPC-CMA. A: CLIP backbone extracts image and text embeddings. B: CMA Loss combines Negative Reweighting (reduces Centroid Gap) and Intra-modal Geometry Matching (reduces Distribution Gap) via $\alpha$. C: Three-Phase Curriculum schedules $\alpha$ from 0 to $\alpha_{\text{target}}$ across Anchor, Gradient-aware Ramp-up, and Stabilize stages, with the transition speed dynamically modulated by the observed gradient dynamics between loss terms.
  • Figure 3: Gap decomposition under Mean-Centering. Despite a 97% Centroid Gap reduction, both Distribution Gap and ROUGE-L remain unchanged, showing that correcting the centroid offset alone does not improve cross-modal compatibility.
  • Figure 4: Gap vs. Accuracy Pareto Frontier. TPC-CMA forms a smooth efficient frontier that envelopes all selected baselines. AlignCLIP and M$^2$-Mix fall significantly below the frontier, while CLIP-Refine fails to reduce the gap.
  • Figure 5: Distribution Gap vs. Raw Gap as predictors of DeCap CIDEr score. (a) Distribution Gap is a near-perfect predictor ($R^2 = 0.986$); (b) Raw Gap yields a substantially weaker fit ($R^2 = 0.691$).
  • ...and 1 more figures