Table of Contents
Fetching ...

Multimodal Transformer With a Low-Computational-Cost Guarantee

Sungjin Park, Edward Choi

TL;DR

The paper investigates the high computational cost of attention in multimodal Transformers and introduces LoCoMT, a Low-Cost Multimodal Transformer that assigns per-head attention views to constrain cross-modal references. Theoretical analysis shows $C_{LoCoMT} \le C_{self} < C_{bottle} < C_{multi}$, with equality to self-attention only for $m=2$ and $L_1=L_2$; this yields lower cost while preserving multimodal fusion. Empirically, LoCoMT achieves competitive accuracy on Audioset and MedVidCL while reducing GFLOPs significantly (e.g., 6.2% on Audioset and 51.3% on MedVidCL), demonstrating practical efficiency gains for multimodal classification. The work highlights the importance of attention-view configuration, showing that mixing patterns across heads and layers can substantially improve the efficiency-accuracy trade-off, with random configurations sometimes performing well and future work aimed at automatic view-configuration search.

Abstract

Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.

Multimodal Transformer With a Low-Computational-Cost Guarantee

TL;DR

The paper investigates the high computational cost of attention in multimodal Transformers and introduces LoCoMT, a Low-Cost Multimodal Transformer that assigns per-head attention views to constrain cross-modal references. Theoretical analysis shows , with equality to self-attention only for and ; this yields lower cost while preserving multimodal fusion. Empirically, LoCoMT achieves competitive accuracy on Audioset and MedVidCL while reducing GFLOPs significantly (e.g., 6.2% on Audioset and 51.3% on MedVidCL), demonstrating practical efficiency gains for multimodal classification. The work highlights the importance of attention-view configuration, showing that mixing patterns across heads and layers can substantially improve the efficiency-accuracy trade-off, with random configurations sometimes performing well and future work aimed at automatic view-configuration search.

Abstract

Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.
Paper Structure (11 sections, 9 equations, 3 figures, 2 tables)

This paper contains 11 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Attention map of a)-d) common multimodal Transformers e) LoCoMT when the input consists of two modalities with length $\mathrm{L}_1$ and $\mathrm{L}_2$ and the number of attention head is $h$. We indicate masked tokens as gray, active tokens as green, and bottleneck tokens as purple. We denote the number of attention view $P_i$ assigned to the attention head by $p_i$.
  • Figure 2: Two consecutive Low-Cost Multimodal Transformer (LoCoMT) layers. Our model can assign different attention patterns across the attention heads and layers, allowing for flexible control over multimodal signals.
  • Figure 3: a) The performance-efficiency trade-off. b) Effect of varying the number of fusion layers and the view frequency. c) Effect of varying the view frequency while keeping the number of fusion layers constant.