Interpretable Tensor Fusion
Saurabh Varshneya, Antoine Ledent, Philipp Liznerski, Andriy Balinskyy, Purvanshi Mehta, Waleed Mustafa, Marius Kloft
TL;DR
InTense proposes interpretable tensor fusion for multimodal learning by jointly learning representations and their fusion, capturing both linear and multiplicative modality interactions. Building on Multiple Neural Learning (MNL), it extends kernel-based ideas to deep networks and introduces normalization schemes (IterBN/vBN) to produce genuine relevance scores for modalities and their interactions, while mitigating higher-order interaction bias. Theoretical analysis supports disentanglement of interaction orders, and extensive experiments on synthetic and six real-world datasets show competitive accuracy alongside strong interpretability compared to state-of-the-art baselines. This work advances transparent multimodal AI with practical impact across sentiment analysis, humor/sarcasm detection, layout design, and digit recognition, enabling safer and more trustworthy deployment in safety-critical domains.
Abstract
Conventional machine learning methods are predominantly designed to predict outcomes based on a single data type. However, practical applications may encompass data of diverse types, such as text, images, and audio. We introduce interpretable tensor fusion (InTense), a multimodal learning method for training neural networks to simultaneously learn multimodal data representations and their interpretable fusion. InTense can separately capture both linear combinations and multiplicative interactions of diverse data types, thereby disentangling higher-order interactions from the individual effects of each modality. InTense provides interpretability out of the box by assigning relevance scores to modalities and their associations. The approach is theoretically grounded and yields meaningful relevance scores on multiple synthetic and real-world datasets. Experiments on six real-world datasets show that InTense outperforms existing state-of-the-art multimodal interpretable approaches in terms of accuracy and interpretability.
