MultiFIX: An XAI-friendly feature inducing approach to building models from multimodal data
Mafalda Malafaia, Thalea Schlender, Peter A. N. Bosman, Tanja Alderliesten
TL;DR
MultiFIX tackles interpretable multimodal learning in healthcare by inducing per-modality features and replacing fusion with symbolic expressions, enabling clear attribution of contributions from images and tabular data. The architecture combines gradient-based feature extraction (Grad-CAM for images) with evolutionary symbolic modeling (GP-GOMEA) to yield end-to-end trainable pipelines that remain verifiable and expla nable. Across synthetic tasks and a melanoma dataset, MultiFIX demonstrates fusion advantages when modalities jointly carry information, while revealing limitations of standard backpropagation when inter-modal dependencies are strong. The work highlights a path toward scalable, explainable multimodal models by integrating evolutionary optimization with deep learning, and suggests future work to advance optimization strategies beyond gradient descent for tightly coupled multimodal systems.
Abstract
In the health domain, decisions are often based on different data modalities. Thus, when creating prediction models, multimodal fusion approaches that can extract and combine relevant features from different data modalities, can be highly beneficial. Furthermore, it is important to understand how each modality impacts the final prediction, especially in high-stake domains, so that these models can be used in a trustworthy and responsible manner. We propose MultiFIX: a new interpretability-focused multimodal data fusion pipeline that explicitly induces separate features from different data types that can subsequently be combined to make a final prediction. An end-to-end deep learning architecture is used to train a predictive model and extract representative features of each modality. Each part of the model is then explained using explainable artificial intelligence techniques. Attention maps are used to highlight important regions in image inputs. Inherently interpretable symbolic expressions, learned with GP-GOMEA, are used to describe the contribution of tabular inputs. The fusion of the extracted features to predict the target label is also replaced by a symbolic expression, learned with GP-GOMEA. Results on synthetic problems demonstrate the strengths and limitations of MultiFIX. Lastly, we apply MultiFIX to a publicly available dataset for the detection of malignant skin lesions.
