Table of Contents
Fetching ...

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Vladimir Iashin, Esa Rahtu

TL;DR

The paper tackles dense video captioning by incorporating both audio and visual cues through a Bi-modal Transformer, featuring a bi-modal encoder–decoder for captioning and a bi-modal multi-headed proposal generator for dense event proposals. It leverages pre-extracted I3D visual features and VGGish audio features, with GloVe-based word embeddings, and introduces modality-specific attention mechanisms and segment-length priors derived via K-means. The approach achieves state-of-the-art results on ActivityNet Captions in learned-proposals settings and demonstrates the value of cross-modal attention through comprehensive ablations, while maintaining a unified training procedure. The work highlights the practical impact of effectively fusing audio and visual information for dense video understanding and provides insights into design choices for multi-modal sequence-to-sequence tasks.

Abstract

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

TL;DR

The paper tackles dense video captioning by incorporating both audio and visual cues through a Bi-modal Transformer, featuring a bi-modal encoder–decoder for captioning and a bi-modal multi-headed proposal generator for dense event proposals. It leverages pre-extracted I3D visual features and VGGish audio features, with GloVe-based word embeddings, and introduces modality-specific attention mechanisms and segment-length priors derived via K-means. The approach achieves state-of-the-art results on ActivityNet Captions in learned-proposals settings and demonstrates the value of cross-modal attention through comprehensive ablations, while maintaining a unified training procedure. The work highlights the practical impact of effectively fusing audio and visual information for dense video understanding and provides insights into design choices for multi-modal sequence-to-sequence tasks.

Abstract

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt

Paper Structure

This paper contains 37 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Example video with the predictions of our model alongside the ground truth.
  • Figure 2: The design of Bi-modal Transformer with Multi-headed Proposal Generator. The proposed model inputs features extracted by VGGish, I3D, and GloVe pre-trained models (bottom left). Then, the bi-modal encoder with N layers processes the audio and visual features and passes its bi-modal representation to the proposal generator (top). After, the generated proposals are used to clip the input features (left). The clipped features are passed through the encoder again. The output of the encoder, then, is used at every layer (N) of the bi-modal decoder (bottom). The decoder attends to the bi-modal encoder's representation as well as the previous caption words and produces its internal representation of the context. This representation is passed to the generator (right) to generate the next word. Residual connections are removed for clarity. Best viewed in color.
  • Figure 3: The Bi-modal Multi-headed Proposal Generator inputs the two-stream output from the bi-modal encoder, processes it with two stacks of proposal generation heads. The predictions from all heads form a common pool of predictions. Thus, the pool consists of $T_v \cdot K_v \cdot |\Psi_v| + T_a \cdot K_a \cdot |\Psi_a|$ proposals, which are sorted on the confidence score and passed back to clip input features to the captioning module.
  • Figure 4: The performance comparison between different modalities (Audio-only, Visual-only, and Bi-modal) in two settings (ground truth and learned proposals) across different YouTube video categories. The video categories are sorted according to the performance of the Audio-Visual model in the learned proposal setup. The number of videos in a category is shown in brackets. ActivityNet Captions validation subset is used for the comparison.
  • Figure 5: The results of the qualitative analysis for a video from ActivityNet Caption validation dataset. The predictions of our bi-modal model are compared to the uni-modal model predictions and ground truth (GT) annotations. The video shows a man who explains how to do a martial art movement---the YouTube video id EIibo7aTpys.