A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer
Vladimir Iashin, Esa Rahtu
TL;DR
The paper tackles dense video captioning by incorporating both audio and visual cues through a Bi-modal Transformer, featuring a bi-modal encoder–decoder for captioning and a bi-modal multi-headed proposal generator for dense event proposals. It leverages pre-extracted I3D visual features and VGGish audio features, with GloVe-based word embeddings, and introduces modality-specific attention mechanisms and segment-length priors derived via K-means. The approach achieves state-of-the-art results on ActivityNet Captions in learned-proposals settings and demonstrates the value of cross-modal attention through comprehensive ablations, while maintaining a unified training procedure. The work highlights the practical impact of effectively fusing audio and visual information for dense video understanding and provides insights into design choices for multi-modal sequence-to-sequence tasks.
Abstract
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt
