Table of Contents
Fetching ...

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

Eileen Wang, Caren Han, Josiah Poon

TL;DR

A novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases and introduces a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs.

Abstract

Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs. Our results demonstrate superior performance across benchmark datasets.

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

TL;DR

A novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases and introduces a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs.

Abstract

Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs. Our results demonstrate superior performance across benchmark datasets.

Paper Structure

This paper contains 31 sections, 13 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Architecture of GEM-VPC. At time $t$, the entire video-specific (VG) and theme graph (TG) corresponding to the action at time $t$ is fed into separate Graph Neural Networks. In the visual stream, visual features summed with positional (PE) and token type embeddings (TE) are inputted into a Recurrent Transformer and the sequence representation ($H_{\text{v-CLS}}$) is then used to select nodes from VG and TG in the node selection module. The selected nodes plus TE are fed into another Recurrent Transformer in the node stream. Cross-attention is employed between the visual and node stream and cross-attended features are finally fed into an MLP to predict the next word.
  • Figure 2: Sum of $n$-gram metrics on ActivityNet (ae-val+ae-test) (left) and YouCook2 (yc2-val) (right) across samples with different number of events.
  • Figure 3: Qualitative Example from ActivityNet. Blue words in the machine-generated captions are visually grounding to the video, while red words represent irrelevant words that are 'hallucinated' by the model.
  • Figure 4: Sum of BLEU-4, METEOR, CIDEr and ROUGE-L scores for the ActivityNet predicted captions across the different video categories using 3 different input modalities (visual only, visual + nodes, visual + nodes + audio). The scores are obtained from the combined validation (ae-val) and testing set (ae-test).
  • Figure 5: Visual example of a sub-graph for the theme graph corresponding to the ActivityNet action class carving pumpkins.
  • ...and 4 more figures