Topological Alignment of Shared Vision-Language Embedding Space
Junwon You, Dasol Kang, Jae-Hun Jung
TL;DR
The paper tackles English-dominant cross-lingual alignment in multilingual contrastive vision-language models by introducing ToMCLIP, a topology-aware training framework. It adds topology-preserving objectives, notably a topological alignment loss $L_{ta}$ based on persistence-diagram distances and a distance-matrix loss $L_{dm}$, alongside the standard pointwise loss $L_{pw}$, with scalable persistence-diagram approximations via MST-based sparsification. Empirical results show ToMCLIP improves zero-shot CIFAR-100 accuracy across 13 languages and enhances multilingual retrieval on xFlickr&CO, validating that preserving global embedding-space topology yields robust cross-lingual and cross-modal structure. The approach generalizes beyond VLMs, offering a principled method to incorporate topological alignment into broader representation-learning tasks.
Abstract
Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.
