Table of Contents
Fetching ...

Linear Alignment of Vision-language Models for Image Captioning

Fabian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler

TL;DR

The paper tackles misalignment between image and text in CLIP and introduces a fast, closed-form linear alignment via an orthogonal Procrustes mapping $\mathbf{W}$ to re-anchor CLIP for downstream image captioning and evaluation. It then builds a lightweight retrieval-augmented captioning pipeline, ReCap, which uses $\mathbf{W}$ to retrieve captions from a datastore and condition a language model to generate new captions, achieving substantial speedups with only $O(d^3)$ computation for $\mathbf{W}$. Additionally, the authors propose two learning-based metrics, aCLIP-S and RefaCLIP-S, that leverage the aligned CLIP space to correlate more strongly with human judgments than prior CLIP-based metrics. Across MS-COCO, Flickr30k, VizWiz, and MSRVTT, ReCap shows competitive or superior performance with far reduced training effort, and the new metrics demonstrate improved alignment with human evaluation, enabling more reliable captioning research and applications.

Abstract

Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition pre-trained language models on visual input. However, CLIP generally suffers from a mis-alignment of image and text modalities in the joint embedding space. We investigate efficient methods to linearly re-align the joint embedding space for the downstream task of image captioning. This leads to an efficient training protocol that merely requires computing a closed-form solution for a linear mapping in the joint CLIP space. Consequently, we propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz and MSRVTT. On the former two, ReCap performs comparably to state-of-the-art lightweight methods using rule-based metrics while outperforming them on most of the CLIP-based metrics. On the latter two benchmarks, ReCap consistently outperforms competitors across all metrics and exhibits strong transfer capabilities and resilience to noise. Finally, we demonstrate that our proposed metrics correlate stronger with human judgement than existing metrics on the Flickr8k-Expert, Flickr8k-Crowdflower, and THumB datasets.

Linear Alignment of Vision-language Models for Image Captioning

TL;DR

The paper tackles misalignment between image and text in CLIP and introduces a fast, closed-form linear alignment via an orthogonal Procrustes mapping to re-anchor CLIP for downstream image captioning and evaluation. It then builds a lightweight retrieval-augmented captioning pipeline, ReCap, which uses to retrieve captions from a datastore and condition a language model to generate new captions, achieving substantial speedups with only computation for . Additionally, the authors propose two learning-based metrics, aCLIP-S and RefaCLIP-S, that leverage the aligned CLIP space to correlate more strongly with human judgments than prior CLIP-based metrics. Across MS-COCO, Flickr30k, VizWiz, and MSRVTT, ReCap shows competitive or superior performance with far reduced training effort, and the new metrics demonstrate improved alignment with human evaluation, enabling more reliable captioning research and applications.

Abstract

Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition pre-trained language models on visual input. However, CLIP generally suffers from a mis-alignment of image and text modalities in the joint embedding space. We investigate efficient methods to linearly re-align the joint embedding space for the downstream task of image captioning. This leads to an efficient training protocol that merely requires computing a closed-form solution for a linear mapping in the joint CLIP space. Consequently, we propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz and MSRVTT. On the former two, ReCap performs comparably to state-of-the-art lightweight methods using rule-based metrics while outperforming them on most of the CLIP-based metrics. On the latter two benchmarks, ReCap consistently outperforms competitors across all metrics and exhibits strong transfer capabilities and resilience to noise. Finally, we demonstrate that our proposed metrics correlate stronger with human judgement than existing metrics on the Flickr8k-Expert, Flickr8k-Crowdflower, and THumB datasets.
Paper Structure (46 sections, 8 equations, 8 figures, 23 tables)

This paper contains 46 sections, 8 equations, 8 figures, 23 tables.

Figures (8)

  • Figure 1: (a) We train a linear mapping $\bm{W}$ to align the image and text embeddings of CLIP toward a dataset. (b) On inference, we employ the mapping to retrieve captions from a datastore that are similar to the input image and provide these along with a prompt to a FLAN-T5 model to generate a new caption.
  • Figure 2: T-SNE visualization of CLIP-embeddings before (left) and after (right) linear re-alignment on the Flickr30k dataset.
  • Figure 3: Pearson correlation between commonly used image captioning metrics for captions generated via $\text{ReCap}$ on the MS-COCO test set.
  • Figure 4: Development of CIDEr-D, SPICE, aCLIP-S, and RefaCLIP-S for DAL on the MS-COCO validation set where we use RefaCLIP-S for quality filtering.
  • Figure 5: Development of the hyperparameter $k$ and the number of synthetic captions per image during DAL on the MS-COCO dataset.
  • ...and 3 more figures