uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Dahyun Chung; Donghyun Shin; Yujin Sung; Seunggi Moon; Jinwoo Jeon; Byung-Jun Lee

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon, Byung-Jun Lee

TL;DR

This work tackles the challenge of extending vision-language models to low-resource languages by removing the need for image–text or text–text paired data. It introduces uCLIP, a pivot-based approach that freezes both the image and multilingual text encoders and trains only a compact 1.7M projection module, using English as a semantic anchor and memory-based soft retrieval to align multilingual text with images. The method combines inter- and intra-alignment losses with embedding perturbations to produce robust cross-modal, cross-lingual representations, achieving strong zero-shot retrieval and classification performance with significantly lower training cost and faster inference than translation-based pipelines. Empirical results across multilingual retrieval benchmarks and zero-shot classification demonstrate substantial gains for five underrepresented languages and confirm the model’s efficiency and transferability across backbones. The approach offers a practical path toward inclusive multimodal learning in multilingual settings with modest computational resources.

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

TL;DR

Abstract

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)