Table of Contents
Fetching ...

One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data

Michal Golovanevsky, Eva Schiller, Akira Nair, Eric Han, Ritambhara Singh, Carsten Eickhoff

TL;DR

This work proposes a new attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities, thus offering a significant reduction in computational complexity compared to existing multimodal attention methods.

Abstract

Multimodal learning models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to autonomous driving. Despite the importance of multimodal learning, existing efforts focus on NLP applications, where the number of modalities is typically less than four (audio, video, text, images). However, data inputs in other domains, such as the medical field, may include X-rays, PET scans, MRIs, genetic screening, clinical notes, and more, creating a need for both efficient and accurate information fusion. Many state-of-the-art models rely on pairwise cross-modal attention, which does not scale well for applications with more than three modalities. For $n$ modalities, computing attention will result in $n \choose 2$ operations, potentially requiring considerable amounts of computational resources. To address this, we propose a new domain-neutral attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities and requires only $n$ attention operations, thus offering a significant reduction in computational complexity compared to existing cross-modal attention algorithms. Using three diverse real-world datasets as well as an additional simulation experiment, we show that our method improves performance compared to popular fusion techniques while decreasing computation costs.

One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data

TL;DR

This work proposes a new attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities, thus offering a significant reduction in computational complexity compared to existing multimodal attention methods.

Abstract

Multimodal learning models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to autonomous driving. Despite the importance of multimodal learning, existing efforts focus on NLP applications, where the number of modalities is typically less than four (audio, video, text, images). However, data inputs in other domains, such as the medical field, may include X-rays, PET scans, MRIs, genetic screening, clinical notes, and more, creating a need for both efficient and accurate information fusion. Many state-of-the-art models rely on pairwise cross-modal attention, which does not scale well for applications with more than three modalities. For modalities, computing attention will result in operations, potentially requiring considerable amounts of computational resources. To address this, we propose a new domain-neutral attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities and requires only attention operations, thus offering a significant reduction in computational complexity compared to existing cross-modal attention algorithms. Using three diverse real-world datasets as well as an additional simulation experiment, we show that our method improves performance compared to popular fusion techniques while decreasing computation costs.
Paper Structure (25 sections, 27 equations, 2 figures, 6 tables)

This paper contains 25 sections, 27 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Integration scheme comparison. (a) Early fusion to self-attention with scaled dot product attention vaswani2017attention, and (b) Pairwise cross-attention integration with scaled dot product attention vaswani2017attention. (c) Our proposed method, One-Versus-Others (OvO), does not rely on pairwise interactions or long concatenated sequences but rather captures all modalities in a single attention score. A modality embedding is represented by $m_i$ and $W$ is a learnable parameter (see Section \ref{['sec:ovo']}).
  • Figure 2: The impact of using OvO attention to fuse simulated data. Using FLOPs as a measure of compute, we demonstrate that OvO grows linearly with respect to the number of modalities, while self and cross-attention grow quadratically.