Table of Contents
Fetching ...

Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan-Jakob Sonke, Efstratios Gavves

TL;DR

The paper targets the persistent modality gap and the alignment–uniformity conflict in InfoNCE-based vision–language alignment. It introduces CS-Aligner, which couples mutual information with the Cauchy–Schwarz divergence, estimated nonparametrically via kernel density estimation, to align both the global distributions of each modality and their cross-modal semantics. The framework extends to unpaired data and token-level alignment, and adopts parameter-efficient adapters/LoRA to keep the approach scalable. Empirical results on text-to-image generation and image-text retrieval demonstrate tighter, more robust alignment and improved generation quality, with the ability to leverage unpaired data to further boost performance. Overall, CS-Aligner offers a distribution-aware, scalable path for robust multimodal alignment across generation and retrieval tasks, with potential extension to broader modalities and diffusion-based models.

Abstract

Multimodal alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across modalities while overlooking distributional differences. In addition, InfoNCE has inherent conflict in terms of alignment and uniformity in multimodality, leading to suboptimal alignment with modality gaps. To overcome the limitations, we propose CS-Aligner, a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. CS-Aligner captures both the global distribution information of each modality and the pairwise semantic relationships. We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE, yielding tighter and more precise alignment. Moreover, by introducing distributional alignment, CS-Aligner enables incorporating additional information from unpaired data and token-level representations, enhancing flexible and fine-grained alignment in practice. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.

Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

TL;DR

The paper targets the persistent modality gap and the alignment–uniformity conflict in InfoNCE-based vision–language alignment. It introduces CS-Aligner, which couples mutual information with the Cauchy–Schwarz divergence, estimated nonparametrically via kernel density estimation, to align both the global distributions of each modality and their cross-modal semantics. The framework extends to unpaired data and token-level alignment, and adopts parameter-efficient adapters/LoRA to keep the approach scalable. Empirical results on text-to-image generation and image-text retrieval demonstrate tighter, more robust alignment and improved generation quality, with the ability to leverage unpaired data to further boost performance. Overall, CS-Aligner offers a distribution-aware, scalable path for robust multimodal alignment across generation and retrieval tasks, with potential extension to broader modalities and diffusion-based models.

Abstract

Multimodal alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across modalities while overlooking distributional differences. In addition, InfoNCE has inherent conflict in terms of alignment and uniformity in multimodality, leading to suboptimal alignment with modality gaps. To overcome the limitations, we propose CS-Aligner, a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. CS-Aligner captures both the global distribution information of each modality and the pairwise semantic relationships. We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE, yielding tighter and more precise alignment. Moreover, by introducing distributional alignment, CS-Aligner enables incorporating additional information from unpaired data and token-level representations, enhancing flexible and fine-grained alignment in practice. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.

Paper Structure

This paper contains 36 sections, 26 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: InfoNCE
  • Figure 2: With CS-Aligner
  • Figure 3: Image-Text pair distance.
  • Figure 5: Toy examples: mutual information (MI $\uparrow$) and distribution divergence ($\downarrow$) between two distributions. Distributions with the same high mutual information value can exhibit either large (a) or small (b) distributional distances, demonstrating that MI alone is insufficient for multimodal alignment. Moreover, distribution divergence measures the closeness between distributions but does not guarantee that the underlying random variables are statistically correlated (c).
  • Figure 6: Illustration of CS-Aligner. We achieve vision-language alignment by freezing the pretrained text and image encoders and applying parameter-efficient fine-tuning methods (e.g., adapter) with our CS-Aligner. CS-Aligner optimizes the adapters using the aggregated CS divergence and InfoNCE, as formulated in Eq. (\ref{['eq:finalobjective']}). Once aligned, the adapters are utilized for various cross-modality tasks: the aligned text adapter facilitates text-to-image generation without additional modifications, while the aligned multimodal adapters are used for vision-language retrieval.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Remark 2.1
  • Remark 3.1
  • Remark 3.2
  • Example B.1
  • Remark D.1
  • Remark D.2