Table of Contents
Fetching ...

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C. H. Hoi

TL;DR

This paper tackles VGDS by introducing Multimodal Transformer Networks (MTN) that fuse text, visual, and audio video features through encoder–decoder architecture augmented with a query-aware auto-encoder. A token-level decoding simulation during training helps align training-time targets with inference-time autoregressive generation, enabling higher-quality responses. MTN demonstrates state-of-the-art results on the DSTC7 Video Scene-aware Dialogue task and generalizes to the visual-dialogue setting (VisDial), with extensive ablations validating the contributions of the QAE and cross-modal attention. The approach offers a scalable, end-to-end framework for reasoning over long video sequences and multimodal inputs, with released PyTorch code for reproducibility and further research.

Abstract

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github.com/henryhungle/MTN.

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

TL;DR

This paper tackles VGDS by introducing Multimodal Transformer Networks (MTN) that fuse text, visual, and audio video features through encoder–decoder architecture augmented with a query-aware auto-encoder. A token-level decoding simulation during training helps align training-time targets with inference-time autoregressive generation, enabling higher-quality responses. MTN demonstrates state-of-the-art results on the DSTC7 Video Scene-aware Dialogue task and generalizes to the visual-dialogue setting (VisDial), with extensive ablations validating the contributions of the QAE and cross-modal attention. The approach offers a scalable, end-to-end framework for reasoning over long video sequences and multimodal inputs, with released PyTorch code for reproducibility and further research.

Abstract

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github.com/henryhungle/MTN.

Paper Structure

This paper contains 18 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A sample dialogue from the DSTC7 Video Scene-aware Dialogue training set with 4 example video scenes. C: Video Caption, S: Video Summary, Qi: $i^{th}$-turn question, Ai: $i^{th}$-turn answer
  • Figure 2: Our MTN architecture includes 3 major components: (i) encoder layers encode text sequences and video features; (ii) decoder layers (D) project target sequence and attend on multiple inputs; and (iii) Query-Aware Auto-Encoder layers (QAE) attend on non-text modalities from query features. For simplicity, Feed Forward, Residual Connection and Layer Normalization layers are not presented. Best viewed in color.
  • Figure 3: 2 types of encoders are used: text-sequence encoders (left) and video encoders (right). Text-sequence encoders are used on text input, i.e. dialogue history, video caption, query, and output sequence. Video encoders are used on visual and audio features of input video.
  • Figure 4: Impact of simulation probability $p$ in BLEU4 measure on the test data. At $p=0.4$ to $0.6$, the improvement in BLEU4 scores is more significant.
  • Figure 5: Comparison of CIDEr measures on the test data between MTN (Base) and the baseline hori2018end across different turn position of the generated responses. Our model outperforms the baselines at all dialogue turn positions.
  • ...and 1 more figures