Table of Contents
Fetching ...

Topological Alignment of Shared Vision-Language Embedding Space

Junwon You, Dasol Kang, Jae-Hun Jung

TL;DR

The paper tackles English-dominant cross-lingual alignment in multilingual contrastive vision-language models by introducing ToMCLIP, a topology-aware training framework. It adds topology-preserving objectives, notably a topological alignment loss $L_{ta}$ based on persistence-diagram distances and a distance-matrix loss $L_{dm}$, alongside the standard pointwise loss $L_{pw}$, with scalable persistence-diagram approximations via MST-based sparsification. Empirical results show ToMCLIP improves zero-shot CIFAR-100 accuracy across 13 languages and enhances multilingual retrieval on xFlickr&CO, validating that preserving global embedding-space topology yields robust cross-lingual and cross-modal structure. The approach generalizes beyond VLMs, offering a principled method to incorporate topological alignment into broader representation-learning tasks.

Abstract

Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.

Topological Alignment of Shared Vision-Language Embedding Space

TL;DR

The paper tackles English-dominant cross-lingual alignment in multilingual contrastive vision-language models by introducing ToMCLIP, a topology-aware training framework. It adds topology-preserving objectives, notably a topological alignment loss based on persistence-diagram distances and a distance-matrix loss , alongside the standard pointwise loss , with scalable persistence-diagram approximations via MST-based sparsification. Empirical results show ToMCLIP improves zero-shot CIFAR-100 accuracy across 13 languages and enhances multilingual retrieval on xFlickr&CO, validating that preserving global embedding-space topology yields robust cross-lingual and cross-modal structure. The approach generalizes beyond VLMs, offering a principled method to incorporate topological alignment into broader representation-learning tasks.

Abstract

Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.

Paper Structure

This paper contains 43 sections, 2 theorems, 22 equations, 5 figures, 19 tables.

Key Result

Theorem 1

Let $0 \le \epsilon \le 1$ and $G_\epsilon = (V,E,\omega_\epsilon)$, Let $m(\epsilon) \coloneqq \#\bigl\{ (0,d) \in D_0^{\mathrm{Rips}}(G) \mid \epsilon < d < \infty \bigr\}$, i.e., the number of finite $0$-dimensional persistence points of $G$ whose death times exceed $\epsilon$. Then, and $0 \leq m(\epsilon) \leq N-1$ where $W_p$ denotes the $p$-Wasserstein distance.

Figures (5)

  • Figure 1: Visualization of text embeddings (English and Korean) in the latent space using t-SNE maaten2008visualizing, from CLIP and multilingual CLIP (MCLIP; carlsson2022cross) text encoders. The Fashion Product Images dataset param_aggarwal_2019 was used, where the productDisplayName field serves as the input caption to the text encoders. Colors indicate the corresponding masterCategory of each product.
  • Figure 2: Overview of the proposed alignment framework between CLIP ($E_T$) and multilingual CLIP (MCLIP; $E_S$) text encoders. $E_S$ is trained to align with the frozen $E_T$ using a combination of loss functions; $L_{\text{pw}}$ enforces point-wise alignment; $L_{\text{ta}}$ and $L_{\text{dm}}$ promote geometric alignment by preserving topological structures. The evaluation is conducted by pairing $E_S$ with the pretrained CLIP image encoder, enabling cross-lingual retrieval in the shared embedding space.
  • Figure 3: Sorted pairwise distance curves of English (En) vs. Korean (Ko) embeddings.
  • Figure 4: Two-dimensional t-SNE projections of English and Korean text embeddings
  • Figure 5: Ablation study on batch size in the low-resource setting.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof