Table of Contents
Fetching ...

StreaMulT: Streaming Multimodal Transformer for Heterogeneous and Arbitrary Long Sequential Data

Victor Pellegrain, Myriam Tami, Michel Batteux, Céline Hudelot

TL;DR

StreaMulT addresses predictive tasks over arbitrarily long, heterogeneous multimodal streams by introducing a Streaming Multimodal Transformer that fuses modalities via crossmodal attention while using a memory bank and block processing to enable streaming inference. The architecture balances space and time complexity, enabling long-sequence training and short-time inference, and demonstrates strong performance on CMU-MOSEI, especially with contextual language embeddings. This work highlights the feasibility and practical impact of long-range multimodal learning for Industry 4.0 scenarios, while also underscoring the need for publicly available industrial datasets to benchmark such streaming multimodal models.

Abstract

The increasing complexity of Industry 4.0 systems brings new challenges regarding predictive maintenance tasks such as fault detection and diagnosis. A corresponding and realistic setting includes multi-source data streams from different modalities, such as sensors measurements time series, machine images, textual maintenance reports, etc. These heterogeneous multimodal streams also differ in their acquisition frequency, may embed temporally unaligned information and can be arbitrarily long, depending on the considered system and task. Whereas multimodal fusion has been largely studied in a static setting, to the best of our knowledge, there exists no previous work considering arbitrarily long multimodal streams alongside with related tasks such as prediction across time. Thus, in this paper, we first formalize this paradigm of heterogeneous multimodal learning in a streaming setting as a new one. To tackle this challenge, we propose StreaMulT, a Streaming Multimodal Transformer relying on cross-modal attention and on a memory bank to process arbitrarily long input sequences at training time and run in a streaming way at inference. StreaMulT improves the state-of-the-art metrics on CMU-MOSEI dataset for Multimodal Sentiment Analysis task, while being able to deal with much longer inputs than other multimodal models. The conducted experiments eventually highlight the importance of the textual embedding layer, questioning recent improvements in Multimodal Sentiment Analysis benchmarks.

StreaMulT: Streaming Multimodal Transformer for Heterogeneous and Arbitrary Long Sequential Data

TL;DR

StreaMulT addresses predictive tasks over arbitrarily long, heterogeneous multimodal streams by introducing a Streaming Multimodal Transformer that fuses modalities via crossmodal attention while using a memory bank and block processing to enable streaming inference. The architecture balances space and time complexity, enabling long-sequence training and short-time inference, and demonstrates strong performance on CMU-MOSEI, especially with contextual language embeddings. This work highlights the feasibility and practical impact of long-range multimodal learning for Industry 4.0 scenarios, while also underscoring the need for publicly available industrial datasets to benchmark such streaming multimodal models.

Abstract

The increasing complexity of Industry 4.0 systems brings new challenges regarding predictive maintenance tasks such as fault detection and diagnosis. A corresponding and realistic setting includes multi-source data streams from different modalities, such as sensors measurements time series, machine images, textual maintenance reports, etc. These heterogeneous multimodal streams also differ in their acquisition frequency, may embed temporally unaligned information and can be arbitrarily long, depending on the considered system and task. Whereas multimodal fusion has been largely studied in a static setting, to the best of our knowledge, there exists no previous work considering arbitrarily long multimodal streams alongside with related tasks such as prediction across time. Thus, in this paper, we first formalize this paradigm of heterogeneous multimodal learning in a streaming setting as a new one. To tackle this challenge, we propose StreaMulT, a Streaming Multimodal Transformer relying on cross-modal attention and on a memory bank to process arbitrarily long input sequences at training time and run in a streaming way at inference. StreaMulT improves the state-of-the-art metrics on CMU-MOSEI dataset for Multimodal Sentiment Analysis task, while being able to deal with much longer inputs than other multimodal models. The conducted experiments eventually highlight the importance of the textual embedding layer, questioning recent improvements in Multimodal Sentiment Analysis benchmarks.

Paper Structure

This paper contains 13 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Multimodal learning in a streaming scheme applied to industrial monitoring
  • Figure 2: Streaming Multimodal Transformer architecture. SCT stands for Streaming Crossmodal Transformer. Different colors represent heterogeneity nature of different modalities, and shadings represent crossmodal features.
  • Figure 3: Block processing for Multimodal learning in a streaming scheme. For modality $\alpha$: $X_\alpha, C_{\alpha,i}, L_{\alpha, i}$ and $R_{\alpha, i}$ respectively correspond to the full input sequence, the initial $i$-th block, and the left and right contexts associated to this block to form the contextual $i$-th segment. $s_{\alpha,i}$ corresponds to the mean of current segment $C_{\alpha,i}$. Blue area represents an initial block for modality $\beta$ while the pink one represents a contextual segment for modality $\gamma$.
  • Figure 4: Streaming Crossmodal Transformer module
  • Figure 5: Flexible scheme. At training time (left), subsequences of $h$ consecutive segments are created to parallelize crossmodal attention operations. At inference (right), one can still process segments one by one to obtain a short-time response.
  • ...and 1 more figures