Table of Contents
Fetching ...

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu

TL;DR

MANGO advances multimodal fusion by replacing implicit Transformer-based cross-attention with explicit, invertible cross-attention within a normalizing-flow framework. The core innovations are the Invertible Cross-Attention layer (ICA) and three partitioning schemes (MMCA, IMCA, LICA) to robustly capture cross-modal correlations, plus a latent-model extension to scale to high-dimensional data. Empirically, MANGO achieves state-of-the-art performance on semantic segmentation, image-to-image translation, and movie genre classification across standard benchmarks, validating its interpretability and tractability advantages. This work offers a scalable, principled approach to fusion learning with potential broad impact on practical multimodal AI systems.

Abstract

Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

TL;DR

MANGO advances multimodal fusion by replacing implicit Transformer-based cross-attention with explicit, invertible cross-attention within a normalizing-flow framework. The core innovations are the Invertible Cross-Attention layer (ICA) and three partitioning schemes (MMCA, IMCA, LICA) to robustly capture cross-modal correlations, plus a latent-model extension to scale to high-dimensional data. Empirically, MANGO achieves state-of-the-art performance on semantic segmentation, image-to-image translation, and movie genre classification across standard benchmarks, validating its interpretability and tractability advantages. This work offers a scalable, principled approach to fusion learning with potential broad impact on practical multimodal AI systems.

Abstract

Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

Paper Structure

This paper contains 15 sections, 13 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Our Cross-Modality Fusion Approach Via Multimodal Normalizing Flows with Invertible Cross-Attention.
  • Figure 2: Our Proposed Multimodal Attention-based Normalizing Flows (MANGO) Approach to Fusion Learning.
  • Figure 3: Our Invertible Cross-Attention (ICA).
  • Figure 4: Our Proposed Partitioning Approaches: Modality-to-Modality Cross-Attention (Left). Inter-Modality Cross-Attention (Middle). Learnable Inter-Modality Cross-Attention (Right).
  • Figure 5: Qualitative Comparison on NYUDv2 Benchmark.
  • ...and 2 more figures