Table of Contents
Fetching ...

Interpretable Tensor Fusion

Saurabh Varshneya, Antoine Ledent, Philipp Liznerski, Andriy Balinskyy, Purvanshi Mehta, Waleed Mustafa, Marius Kloft

TL;DR

InTense proposes interpretable tensor fusion for multimodal learning by jointly learning representations and their fusion, capturing both linear and multiplicative modality interactions. Building on Multiple Neural Learning (MNL), it extends kernel-based ideas to deep networks and introduces normalization schemes (IterBN/vBN) to produce genuine relevance scores for modalities and their interactions, while mitigating higher-order interaction bias. Theoretical analysis supports disentanglement of interaction orders, and extensive experiments on synthetic and six real-world datasets show competitive accuracy alongside strong interpretability compared to state-of-the-art baselines. This work advances transparent multimodal AI with practical impact across sentiment analysis, humor/sarcasm detection, layout design, and digit recognition, enabling safer and more trustworthy deployment in safety-critical domains.

Abstract

Conventional machine learning methods are predominantly designed to predict outcomes based on a single data type. However, practical applications may encompass data of diverse types, such as text, images, and audio. We introduce interpretable tensor fusion (InTense), a multimodal learning method for training neural networks to simultaneously learn multimodal data representations and their interpretable fusion. InTense can separately capture both linear combinations and multiplicative interactions of diverse data types, thereby disentangling higher-order interactions from the individual effects of each modality. InTense provides interpretability out of the box by assigning relevance scores to modalities and their associations. The approach is theoretically grounded and yields meaningful relevance scores on multiple synthetic and real-world datasets. Experiments on six real-world datasets show that InTense outperforms existing state-of-the-art multimodal interpretable approaches in terms of accuracy and interpretability.

Interpretable Tensor Fusion

TL;DR

InTense proposes interpretable tensor fusion for multimodal learning by jointly learning representations and their fusion, capturing both linear and multiplicative modality interactions. Building on Multiple Neural Learning (MNL), it extends kernel-based ideas to deep networks and introduces normalization schemes (IterBN/vBN) to produce genuine relevance scores for modalities and their interactions, while mitigating higher-order interaction bias. Theoretical analysis supports disentanglement of interaction orders, and extensive experiments on synthetic and six real-world datasets show competitive accuracy alongside strong interpretability compared to state-of-the-art baselines. This work advances transparent multimodal AI with practical impact across sentiment analysis, humor/sarcasm detection, layout design, and digit recognition, enabling safer and more trustworthy deployment in safety-critical domains.

Abstract

Conventional machine learning methods are predominantly designed to predict outcomes based on a single data type. However, practical applications may encompass data of diverse types, such as text, images, and audio. We introduce interpretable tensor fusion (InTense), a multimodal learning method for training neural networks to simultaneously learn multimodal data representations and their interpretable fusion. InTense can separately capture both linear combinations and multiplicative interactions of diverse data types, thereby disentangling higher-order interactions from the individual effects of each modality. InTense provides interpretability out of the box by assigning relevance scores to modalities and their associations. The approach is theoretically grounded and yields meaningful relevance scores on multiple synthetic and real-world datasets. Experiments on six real-world datasets show that InTense outperforms existing state-of-the-art multimodal interpretable approaches in terms of accuracy and interpretability.
Paper Structure (81 sections, 5 theorems, 39 equations, 7 figures, 5 tables)

This paper contains 81 sections, 5 theorems, 39 equations, 7 figures, 5 tables.

Key Result

Theorem 1

The optimization problem in equation eq:BASIC is equivalent to the following problem, where the parameters $\beta$ are no longer present: where $q=\frac{2p}{p+1}$ (and therefore $1\leq q\leq 2$). The corresponding values of relevance score $\beta$ can be recovered after the optimization as:

Figures (7)

  • Figure 1: Left is an excerpt of the MUStARD dataset on sarcasm detection, where the proposed InTense method sets a new state-of-the-art. (See Section 4 for details.) A linear combination of modalities fails here because the expressions of happiness and anxiety combine to something neutral rather than sarcasm. To detect sarcasm, the interactions among modalities are crucial. InTense captures these interactions and assigns them with interpretable relevance scores, shown in the pie chart. Scores for individual modalities and their interactions are colored green and blue, respectively. InTense reveals that interactions are crucial for successful sarcasm detection.
  • Figure 2: An excerpt of three modalities of toydata, our self-curated binary classification dataset, where each sequence is made from a set of letters {A,C,G,T}. A positive class-sequence "TCG" and a negative class-sequence "AGC" is added according to the probability $p_m$.
  • Figure 3: The figure shows a high correlation of InTense' relevance scores and accuracies of unimodal models on toydata. The modalities $M_2, M_4, M_7$ achieve high relevance scores and high accuracy as they contain class-specific information. Other modalities contain no class-specific information, which leads to a very low relevance score and an accuracy of around $50\%$ (equivalent to random guessing).
  • Figure 4: Illustration of the relevance scores calculated by the proposed InTense and the MultiRoute baseline when higher-order modality interactions are involved in the ground truth. MultiRoute leads to biased results (blue bars), where the relevance scores are concentrated toward higher-order interactions $M_{1\otimes2\otimes3}$. In contrast, InTense (orange bars) correctly assigns a high relevance score only to the interaction $M_{1\otimes2}$, which contains all class-specific signals.
  • Figure 5: Relevance scores from InTense for audio (a), vision (v), text (t), and all their possible interactions.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Proposition A.3
  • proof
  • proof : Proof of theorem \ref{['DasTheorem']}
  • proof