Table of Contents
Fetching ...

RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Amin Beheshti, Quan Z. Sheng, Qingming Huang

TL;DR

RETTA tackles zero-shot video captioning by fusing a video-text retriever (XCLIP), an image-text matcher (CLIP), a text alignment model (AnglE), and a language generator (GPT-2) through learnable tokens that are updated at test time. The framework uses retrieval-guided losses and vision-language alignment losses to adapt the frozen components without ground-truth data, followed by a CommonGen-based sentence-cleaning pass. Inference runs for 16 iterations per video, producing multiple captions and selecting the most video-relevant one, achieving state-of-the-art CIDEr scores on MSR-VTT, MSVD, and VATEX among zero-shot methods. The results demonstrate effective cross-modal knowledge integration and practical, fast adaptation for zero-shot video captioning, with potential for offline large-scale retrieval and cross-domain extension.

Abstract

Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.

RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

TL;DR

RETTA tackles zero-shot video captioning by fusing a video-text retriever (XCLIP), an image-text matcher (CLIP), a text alignment model (AnglE), and a language generator (GPT-2) through learnable tokens that are updated at test time. The framework uses retrieval-guided losses and vision-language alignment losses to adapt the frozen components without ground-truth data, followed by a CommonGen-based sentence-cleaning pass. Inference runs for 16 iterations per video, producing multiple captions and selecting the most video-relevant one, achieving state-of-the-art CIDEr scores on MSR-VTT, MSVD, and VATEX among zero-shot methods. The results demonstrate effective cross-modal knowledge integration and practical, fast adaptation for zero-shot video captioning, with potential for offline large-scale retrieval and cross-domain extension.

Abstract

Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.
Paper Structure (14 sections, 10 equations, 7 figures, 10 tables)

This paper contains 14 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Traditional retrieval-augmented-generation paradigm performs well on pure natural language tasks, such as text-based question answering. However, its performance sucks when directly apply to multi-modality tasks, such as video captioning.
  • Figure 2: The workflow of the proposed method, which consists of frozen pre-trained foundation vision and language models as well as trainable tokens. Unlike conventional soft prompt training, we update these tokens directly during inference with soft target. In this way, the frozen foundation models can quickly adapt to the video captioning task for zero-shot application. $\mathcal{L}_{language}$ is computed between token probability distributions generated with (i.e. writing) and without (i.e. reading) the trainable soft prompt.$\mathcal{L}_{vision}$ calculates a matching score between the current generated text and visual information leveraging CLIP based on keyframes. $\mathcal{L}_{retrieval}^S$ measures a matching score between the current generated text and retrieved sentences using AnglE, and $\mathcal{L}_{retrieval}^W$ focuses on the high-frequency words of them. The above optimizations take place at each step of the autoregressive process.
  • Figure 3: Our sentence-cleaning post-process strategy extract key elements, and use them to reconstruct the clean caption with the pre-trained CommonGen.
  • Figure 4: Illustration of our CLIP-based sampling strategy. Selected frames are highlighted in red, while filtered (redundant) frames are marked in green.
  • Figure 5: Performance comparison on the VATEX dataset under different settings: (a)–(b) retrieved sentence count, (c)–(d) soft token count, (e)–(f) frequency size, and (g)–(h) candidate word count.
  • ...and 2 more figures