Table of Contents
Fetching ...

Multimodal Transformer for Unaligned Multimodal Language Sequences

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov

TL;DR

MulT reframes multimodal fusion as a Transformer-based crossmodal attention problem that operates on unaligned time-series from language, vision, and audio. It avoids explicit alignment by letting crossmodal attention dynamically adapt representations across modalities, enabling long-range crossmodal dependencies. Across MOSI, MOSEI, and IEMOCAP, MulT achieves state-of-the-art results in both word-aligned and unaligned settings and its ablations show the value of low-level crossmodal adaptation. This work highlights crossmodal attention as a powerful mechanism for multimodal fusion and suggests broader applications such as video question answering.

Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Multimodal Transformer for Unaligned Multimodal Language Sequences

TL;DR

MulT reframes multimodal fusion as a Transformer-based crossmodal attention problem that operates on unaligned time-series from language, vision, and audio. It avoids explicit alignment by letting crossmodal attention dynamically adapt representations across modalities, enabling long-range crossmodal dependencies. Across MOSI, MOSEI, and IEMOCAP, MulT achieves state-of-the-art results in both word-aligned and unaligned settings and its ablations show the value of low-level crossmodal adaptation. This work highlights crossmodal attention as a powerful mechanism for multimodal fusion and suggests broader applications such as video question answering.

Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Paper Structure

This paper contains 27 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Example video clip from movie reviews. [Top]: Illustration of word-level alignment where video and audio features are averaged across the time interval of each spoken word. [Bottom] Illustration of crossmodal attention weights between text ("spectacle") and vision/audio.
  • Figure 2: Overall architecture for MulT on modalities ($L, V, A$). The crossmodal transformers, which suggests latent crossmodal adaptations, are the core components of MulT for multimodal fusion.
  • Figure 3: Architectural elements of a crossmodal transformer between two time-series from modality $\alpha$ and $\beta$.
  • Figure 4: An example of visualizing alignment using attention matrix from modality $\beta$ to $\alpha$. Multimodal alignment is a special (monotonic) case for crossmodal attention.
  • Figure 5: Validation set convergence of MulT when compared to other baselines on the unaligned CMU-MOSEI task.
  • ...and 1 more figures