Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Young Kyun Jang; Ser-nam Lim

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Young Kyun Jang, Ser-nam Lim

TL;DR

XBT addresses the backfilling problem in cross-modal vision-language retrieval by enforcing backward compatibility between frozen old VLP embeddings and stronger new models without re-embedding the gallery. The method uses a text-only projection module to align new embeddings with the old embedding space and then applies parameter-efficient training to update only lightweight components. The training pipeline comprises two stages: text-only pretraining and image-text supervised learning, requiring far fewer image-text pairs than building a VLP from scratch. Experiments on nocaps, Flickr, and COCO demonstrate robust cross-modal backward compatibility and backfill-free upgrades, with strong zero-shot and transfer behavior.

Abstract

Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 5 figures, 7 tables)

This paper contains 21 sections, 6 equations, 5 figures, 7 tables.

Introduction
Related Works
Backward-compatible Training.
Vision-Language Continual and Transfer Learning
Method
Criterion for Cross-modal Backward Compatibility
Text-only Pretraining
Cross-modal Backward-compatible Training
Experiments
Setup
Model Training.
Model Evaluation.
Implementation Details.
Main Results
Further Analysis
...and 6 more sections

Figures (5)

Figure 1: A conceptual visualization of Backward-compatible Training (BT, above) and its extension, the Cross(X)-modal version (XBT, bottom). Circles and squares denote data samples of images and text, respectively. XBT uses Vision-Language Pretraining (VLP) models as baselines to achieve cross-modal, backward-compatible representation learning, allowing the new, improved model to be compatible with the fixed old model.
Figure 2: An illustration of the text-only pretraining of $\phi$ (above), and XBT with $\phi$ (below). Using only text samples, $\phi$ is trained to approximate distribution of old text embeddings from that of new ones. After pretraining, the same $\phi$ is used to generate both of synthetic old image and text embeddings from the new VLP embeddings and train to learn cross-modal backward-compatible representation.
Figure 3: Our proposed learning process to achieve XBT. Notably, the old VLP model's encoders are not required in this stage, enhancing efficiency in training.
Figure 4: New query vs. Old gallery retrieval results on $nocaps$. B32 as old, and L14 as new model.
Figure 5: A tSNE visualization of 5,000 paired image-text embeddings from COCO MSCOCO dataset, using two different CLIP models CLIP, and two different BLIP models BLIP. Five pairs are marked as examples. The distinct distributions of image and text samples in each VLP space are observed.

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

TL;DR

Abstract

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)