Table of Contents
Fetching ...

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham

TL;DR

The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

Abstract

Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

TL;DR

The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

Abstract

Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.
Paper Structure (53 sections, 24 equations, 9 figures, 11 tables)

This paper contains 53 sections, 24 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: ViCLIP-OT architecture overview. The model consists of a DINOv3-based image encoder and a Vietnamese Sentence-BERT text encoder that project images and texts into a shared embedding space. The hybrid training objective combines a CLIP-style contrastive loss with the proposed SIGROT loss, which uses a similarity graph and optimal transport to enforce global cross-modal alignment.
  • Figure 2: R@K comparison on UIT-OpenViIC for text-to-image (left) and image-to-text (right) retrieval tasks. Incorporating the SIGROT loss consistently improves performance over both CLIP and SigLIP baselines across all R@K metrics.
  • Figure 3: UMAP visualization of image and text embeddings on the UIT-OpenViIC test set. Each subplot corresponds to a different training objective. Circles represent image embeddings and triangles represent text embeddings, with colors indicating pseudo labels obtained via K-Means clustering ($k=20$). SIGROT-based methods exhibit tighter cross-modal clustering compared to baselines.
  • Figure 4: GradCAM visualization comparing the baseline SigLIP and the proposed ViSigLIP-OT on the UIT-OpenViIC test set. Each row shows the original image alongside the GradCAM heatmaps from both models for a given Vietnamese text query. In the first two rows, ViSigLIP-OT focuses more precisely on the query-relevant objects (the girl wearing an Ao dai and the man holding apples in his hands), while SigLIP spreads activations over background regions. In the third row, SigLIP correctly attends to the man standing next to a car, whereas ViSigLIP-OT highlights irrelevant background areas.
  • Figure 5: Effect of (a) the number of last unfrozen groups in the image encoder using ViCLIP-OT, and (b) the hybrid loss weight $\lambda$ where $\lambda=0$ corresponds to SIGROT only. Peak performance occurs at 13 unfrozen groups (69.62%) and $\lambda=0.2$ for ViCLIP-OT, $\lambda=0.1$ for ViSigLIP-OT.
  • ...and 4 more figures