Table of Contents
Fetching ...

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, Yuexian Zou

TL;DR

This work tackles language inclusivity in vision-language models by proposing continual language learning (CLL) to extend CLIP's multilingual capacity without retraining the full model. It introduces CLL-CLIP, which freezes the vision/text encoders and adds an expandable token embedding layer, and TEIR, a initialization-and-regularization strategy to prevent catastrophic forgetting. The authors validate on a 36-language MSCOCO/XM3600 benchmark, showing TEIR provides consistent gains across baselines and state-of-the-art methods for multilingual image-text retrieval, including notable improvements in text-to-image Recall@1. This approach offers a practical path to deploy VL-PTMs across diverse languages and reduces reliance on costly joint-training or translation pipelines.

Abstract

While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at \url{https://github.com/yangbang18/CLFM}.

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

TL;DR

This work tackles language inclusivity in vision-language models by proposing continual language learning (CLL) to extend CLIP's multilingual capacity without retraining the full model. It introduces CLL-CLIP, which freezes the vision/text encoders and adds an expandable token embedding layer, and TEIR, a initialization-and-regularization strategy to prevent catastrophic forgetting. The authors validate on a 36-language MSCOCO/XM3600 benchmark, showing TEIR provides consistent gains across baselines and state-of-the-art methods for multilingual image-text retrieval, including notable improvements in text-to-image Recall@1. This approach offers a practical path to deploy VL-PTMs across diverse languages and reduces reliance on costly joint-training or translation pipelines.

Abstract

While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at \url{https://github.com/yangbang18/CLFM}.
Paper Structure (30 sections, 6 equations, 8 figures, 10 tables)

This paper contains 30 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: For continual language learning, prior works in NLP garcia2021continualhuang2022EVS train full model parameters to learn a new language, with new token embeddings initialized randomly without considering the distribution of prior ones. Our work requires the least amount of components to be trained (i.e., the token embedding layer) and targets token embedding initialization and regularization to avert catastrophic forgetting. Note that our frozen vision PTM is not plotted for clarity.
  • Figure 2: Overview of our proposals. (a): CLL-CLIP builds upon a two-tower VL-PTM (i.e., CLIP), keeps all pre-trained components frozen, and contains an expandable and trainable token embedding layer for continual language learning. (b): Our TEIR approach eases catastrophic forgetting by underscoring the initialization and regularization of token embeddings.
  • Figure 3: Convergence analysis for different settings in Table \ref{['tab:ablation_TEIR']}, focusing (left) the training loss and (right) the Fisher eigenvalues. Lower values respectively indicate closer to global minima and the convergence to flatter minima.
  • Figure 4: Analysis of CLL-CLIP's core designs: (left) trainable components and (right) training objectives.
  • Figure 5: Translate-test performance on Hebrew data in XM3600. Although translate-test CLIP is a strong pipeline system, our model can process foreign texts directly ($\eta = 0$) or achieve better retrieval performance via score fusion when translations are available ($\eta > 0$).
  • ...and 3 more figures