Table of Contents
Fetching ...

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon, Byung-Jun Lee

TL;DR

This work tackles the challenge of extending vision-language models to low-resource languages by removing the need for image–text or text–text paired data. It introduces uCLIP, a pivot-based approach that freezes both the image and multilingual text encoders and trains only a compact 1.7M projection module, using English as a semantic anchor and memory-based soft retrieval to align multilingual text with images. The method combines inter- and intra-alignment losses with embedding perturbations to produce robust cross-modal, cross-lingual representations, achieving strong zero-shot retrieval and classification performance with significantly lower training cost and faster inference than translation-based pipelines. Empirical results across multilingual retrieval benchmarks and zero-shot classification demonstrate substantial gains for five underrepresented languages and confirm the model’s efficiency and transferability across backbones. The approach offers a practical path toward inclusive multimodal learning in multilingual settings with modest computational resources.

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

TL;DR

This work tackles the challenge of extending vision-language models to low-resource languages by removing the need for image–text or text–text paired data. It introduces uCLIP, a pivot-based approach that freezes both the image and multilingual text encoders and trains only a compact 1.7M projection module, using English as a semantic anchor and memory-based soft retrieval to align multilingual text with images. The method combines inter- and intra-alignment losses with embedding perturbations to produce robust cross-modal, cross-lingual representations, achieving strong zero-shot retrieval and classification performance with significantly lower training cost and faster inference than translation-based pipelines. Empirical results across multilingual retrieval benchmarks and zero-shot classification demonstrate substantial gains for five underrepresented languages and confirm the model’s efficiency and transferability across backbones. The approach offers a practical path toward inclusive multimodal learning in multilingual settings with modest computational resources.

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.

Paper Structure

This paper contains 39 sections, 12 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Performance Comparison. We compare models by plotting the average image-to-text Recall@10 across five underrepresented languages in the XM3600 benchmark against the number of trainable parameters (in millions). Each marker shape indicates the type of supervision used during training: circle for models trained without image-text (multilingual or English) pairs (I-T) or multilingual-English text pairs (T-T), square for those trained with T-T pairs only, and triangle for models using both I-T and T-T pairs. Despite having only 1.7M parameters and no paired supervision, uCLIP achieves the highest average Recall@10, outperforming all baselines.
  • Figure 2: Average z-score of Recall@10 across languages. We evaluate multilingual VLM performance on the XM3600 benchmark using Recall@10 from four baseline models: AltCLIP-18, SigLIP2, NLLB-CLIP, and M-CLIP. For each model, we compute the z-score per language, indicating how much its Recall@10 deviates from the model-wise mean. The final score is the average z-score across models. Languages highlighted in red represent the five low-resource languages we target (cs, fi, hr, hu, ro). Unsupported languages by our multilingual text encoder (e.g., bn, fil, mi, quz, sw, te) are excluded from our evaluation.
  • Figure 3: Overall architecture. We propose a lightweight alignment framework that bridges multilingual text and image embeddings via English, without requiring paired I-T and T-T data or encoder finetuning. uCLIP employs frozen encoders along with compact projection heads to map inputs into a shared embedding space. At inference time, only multilingual text encoder, image encoder and projectors are used. The model directly encodes multilingual text and image inputs using the frozen encoders, followed by projection into the shared space.
  • Figure 4: Cosine similarity visualization for embeddings of text and image queries. We visualize the cosine similarity matrices under five different settings: (a) AltCLIP-18, (b) SigLIP2, (c) NLLB-CLIP, (d) M-CLIP, and (e) using our proposed uCLIP model. All text samples are translated in each five language from Flickr30k benchmark.
  • Figure 5: UMAP visualization of image embedding. We visualize embeddings extracted from (a) CLIP image encoder, (b) uCLIP model with CLIP image encoder, (c) SigLIP2 image encoder, and (d) uCLIP model with SigLIP2 image encoder. Results are based on CIFAR-10.
  • ...and 3 more figures