Table of Contents
Fetching ...

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Peng Li, Yan Wang, Bing Li, Weiming Hu

TL;DR

This work tackles the challenge of generating diverse, high-quality video captions by reframing diverse captioning as a set prediction problem. It introduces Semantic-Concept-Guided Set Prediction (SCG-SP), which builds semantics-specific encodings from temporal video features and concept queries, and decodes a set of captions each paired with a concept combination. A set-level loss with Hungarian matching, along with a diversity-promoting term and an auxiliary concept-classification task, drives intra-set and inter-set reasoning and improves interpretability. Experiments on MSVD, MSRVTT, and VATEX show state-of-the-art performance in both relevance and diversity, with ablations and qualitative analyses demonstrating the value of semantic guidance and set-based optimization.

Abstract

Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

TL;DR

This work tackles the challenge of generating diverse, high-quality video captions by reframing diverse captioning as a set prediction problem. It introduces Semantic-Concept-Guided Set Prediction (SCG-SP), which builds semantics-specific encodings from temporal video features and concept queries, and decodes a set of captions each paired with a concept combination. A set-level loss with Hungarian matching, along with a diversity-promoting term and an auxiliary concept-classification task, drives intra-set and inter-set reasoning and improves interpretability. Experiments on MSVD, MSRVTT, and VATEX show state-of-the-art performance in both relevance and diversity, with ablations and qualitative analyses demonstrating the value of semantic guidance and set-based optimization.

Abstract

Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.
Paper Structure (39 sections, 7 equations, 9 figures, 13 tables)

This paper contains 39 sections, 7 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Difference between (a) existing CVAE/control-based diverse captioning methods and (b) our proposed SCG-SP. There is no direct interaction among generated captions in CVAE-based or control-based methods, where the loss is calculated with independent training samples. Our proposed SCG-SP generates captions based on multiple semantics-specific visual encodings with sufficient interaction and is trained by a set-level prediction loss to exploit the set-level relationship.
  • Figure 2: Overview of the proposed SCG-SP. Based on pre-extracted video frame features, we first employ a temporal encoder, a concept detector, and a concept driven encoder to obtain multiple semantics-specific encodings for the input video. In the parallel decoding stage, we apply a caption head and a classification head to respectively decode each encoding into a caption sentence and a concept combination label, which together form the prediction set. By performing element matching between the predicted set and the ground-truth set, the set prediction loss is calculated over matched element pairs. Note that the ground-truth concept combination labels are assigned by taking nouns and verbs of high word frequency from the captions.
  • Figure 3: Illustration of (a) the lightweight LSTM captioner and (b) the prefix-GPT captioner.
  • Figure 4: Distribution of semantics-specific encodings. Different colors stand for encodings of different videos. We take the same part from the distribution maps of Base and w/ $L_{div}$ for clear comparison. Best viewed in color.
  • Figure 5: An example of generations by SCG-SP-Prefix on MSRVTT. Concepts both showed in predicted combinations and captions are highlighted in red. Best viewed in color.
  • ...and 4 more figures