Table of Contents
Fetching ...

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

Mingkai Tian, Guorong Li, Yuankai Qi, Amin Beheshti, Javen Qinfeng Shi, Anton van den Hengel, Qingming Huang

TL;DR

This work tackles zero-shot video captioning by explicitly modeling scene content through a progressive, multi-granularity prompting framework. It builds three textual memory banks—noun phrases, enhanced scene graphs, and entire captions—and leverages category-aware retrieval with top-p filtering to generate diverse, contextually rich prompts that guide a lightweight decoder. The approach yields state-of-the-art zero-shot results on MSR-VTT, MSVD, and VATEX in-domain and demonstrates strong cross-domain generalization, supported by extensive ablations that validate the contributions of each memory bank and retrieval strategy. The findings highlight the value of explicit, structured textual representations for visual grounding and show promise for scaling with larger multimodal models.

Abstract

Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract visual-relevant textual prompts to guide language models in generating captions. These methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics in question. Extensive experiments demonstrate the effectiveness of our method with 5.7%, 16.2%, and 3.4% improvements in terms of the main metric CIDEr on MSR-VTT, MSVD, and VATEX benchmarks compared to existing state-of-the-art.

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

TL;DR

This work tackles zero-shot video captioning by explicitly modeling scene content through a progressive, multi-granularity prompting framework. It builds three textual memory banks—noun phrases, enhanced scene graphs, and entire captions—and leverages category-aware retrieval with top-p filtering to generate diverse, contextually rich prompts that guide a lightweight decoder. The approach yields state-of-the-art zero-shot results on MSR-VTT, MSVD, and VATEX in-domain and demonstrates strong cross-domain generalization, supported by extensive ablations that validate the contributions of each memory bank and retrieval strategy. The findings highlight the value of explicit, structured textual representations for visual grounding and show promise for scaling with larger multimodal models.

Abstract

Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract visual-relevant textual prompts to guide language models in generating captions. These methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics in question. Extensive experiments demonstrate the effectiveness of our method with 5.7%, 16.2%, and 3.4% improvements in terms of the main metric CIDEr on MSR-VTT, MSVD, and VATEX benchmarks compared to existing state-of-the-art.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: Current zero-shot captioning methods are easily distracted. NP, SG, and EC denote noun phrase, scene graph (triplets are displayed as concatenated strings), and entire caption prompt, respectively. "Res" indicates the generated captions, with correct and incorrect words highlighted. In the top example, MultiCapCLIP DBLP:conf/acl/MultiCapCLIP fails to capture the rider-motorcycle interaction as the prior does not model the distribution of subject-object interactions effectively. In the bottom example, MultiCapCLIP's top-K retrieval strategy produces repetitive similar noun phrases and lacks person and environment information, because it fails to model either the structure of the scene or of natural language. DeCap DBLP:conf/iclr/DeCap struggles to fully understand the video details due to its coarse-grained prompt of global caption embedding. By contrast, our method generates more accurate and comprehensive descriptions.
  • Figure 2: We construct noun phrase memory bank $\mathcal{M}_{\text{NP}}$ and scene graph memory bank $\mathcal{M}_{\text{SG}}$ by parsing training captions, selecting high-frequency elements, and enhancing scene graphs with noun phrases to include more attribute information. The entire caption memory bank $\mathcal{M}_{\text{EC}}$ contains all training captions. During training, following MultiCapCLIP DBLP:conf/acl/MultiCapCLIP, we retrieve top-K elements from memory banks using perturbed embedding $\tilde{\mathbf{e}}$ and train a text decoder to reconstruct the original text. During inference, we first classify $\mathcal{M}_{\text{NP}}$ with GPT-4 gpt4 and $\mathcal{M}_{\text{SG}}$ based on noun phrase categories, compute statistical priors, retrieve a diverse set of relevant noun phrase and scene graph elements using CLIP embeddings with top-p filtering and generate a weighted embedding from $\mathcal{M}_{\text{EC}}$ using softmax similarity scores between video and caption features. Three types of prompt are transformed by respective FFNs and concatenated to generate the final caption.
  • Figure 3: Comparison of generated captions of our method and other state-of-the-art methods. We emphasize ground-truth important words and accurate words in our generated descriptions.
  • Figure 4: Distribution of Noun Phrases across Different Categories in the MSR-VTT, MSVD, and VATEX Datasets.
  • Figure 5: Impact of number of selected elements from noun phrase and scene graph memory banks on in-domain CIDEr scores. NP: Noun Phrase, SG: Scene Graph.
  • ...and 1 more figures