Table of Contents
Fetching ...

MultiFIX: An XAI-friendly feature inducing approach to building models from multimodal data

Mafalda Malafaia, Thalea Schlender, Peter A. N. Bosman, Tanja Alderliesten

TL;DR

MultiFIX tackles interpretable multimodal learning in healthcare by inducing per-modality features and replacing fusion with symbolic expressions, enabling clear attribution of contributions from images and tabular data. The architecture combines gradient-based feature extraction (Grad-CAM for images) with evolutionary symbolic modeling (GP-GOMEA) to yield end-to-end trainable pipelines that remain verifiable and expla nable. Across synthetic tasks and a melanoma dataset, MultiFIX demonstrates fusion advantages when modalities jointly carry information, while revealing limitations of standard backpropagation when inter-modal dependencies are strong. The work highlights a path toward scalable, explainable multimodal models by integrating evolutionary optimization with deep learning, and suggests future work to advance optimization strategies beyond gradient descent for tightly coupled multimodal systems.

Abstract

In the health domain, decisions are often based on different data modalities. Thus, when creating prediction models, multimodal fusion approaches that can extract and combine relevant features from different data modalities, can be highly beneficial. Furthermore, it is important to understand how each modality impacts the final prediction, especially in high-stake domains, so that these models can be used in a trustworthy and responsible manner. We propose MultiFIX: a new interpretability-focused multimodal data fusion pipeline that explicitly induces separate features from different data types that can subsequently be combined to make a final prediction. An end-to-end deep learning architecture is used to train a predictive model and extract representative features of each modality. Each part of the model is then explained using explainable artificial intelligence techniques. Attention maps are used to highlight important regions in image inputs. Inherently interpretable symbolic expressions, learned with GP-GOMEA, are used to describe the contribution of tabular inputs. The fusion of the extracted features to predict the target label is also replaced by a symbolic expression, learned with GP-GOMEA. Results on synthetic problems demonstrate the strengths and limitations of MultiFIX. Lastly, we apply MultiFIX to a publicly available dataset for the detection of malignant skin lesions.

MultiFIX: An XAI-friendly feature inducing approach to building models from multimodal data

TL;DR

MultiFIX tackles interpretable multimodal learning in healthcare by inducing per-modality features and replacing fusion with symbolic expressions, enabling clear attribution of contributions from images and tabular data. The architecture combines gradient-based feature extraction (Grad-CAM for images) with evolutionary symbolic modeling (GP-GOMEA) to yield end-to-end trainable pipelines that remain verifiable and expla nable. Across synthetic tasks and a melanoma dataset, MultiFIX demonstrates fusion advantages when modalities jointly carry information, while revealing limitations of standard backpropagation when inter-modal dependencies are strong. The work highlights a path toward scalable, explainable multimodal models by integrating evolutionary optimization with deep learning, and suggests future work to advance optimization strategies beyond gradient descent for tightly coupled multimodal systems.

Abstract

In the health domain, decisions are often based on different data modalities. Thus, when creating prediction models, multimodal fusion approaches that can extract and combine relevant features from different data modalities, can be highly beneficial. Furthermore, it is important to understand how each modality impacts the final prediction, especially in high-stake domains, so that these models can be used in a trustworthy and responsible manner. We propose MultiFIX: a new interpretability-focused multimodal data fusion pipeline that explicitly induces separate features from different data types that can subsequently be combined to make a final prediction. An end-to-end deep learning architecture is used to train a predictive model and extract representative features of each modality. Each part of the model is then explained using explainable artificial intelligence techniques. Attention maps are used to highlight important regions in image inputs. Inherently interpretable symbolic expressions, learned with GP-GOMEA, are used to describe the contribution of tabular inputs. The fusion of the extracted features to predict the target label is also replaced by a symbolic expression, learned with GP-GOMEA. Results on synthetic problems demonstrate the strengths and limitations of MultiFIX. Lastly, we apply MultiFIX to a publicly available dataset for the detection of malignant skin lesions.
Paper Structure (29 sections, 2 equations, 9 figures, 13 tables)

This paper contains 29 sections, 2 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Overview of MultiFIX. The available data is given as input to the feature-inducing blocks (NNs in this paper). The output thereof is passed into a fusion block to obtain the final prediction. Representative features (I from image data and T from tabular data) - are thereby learned simultaneously when training the entire architecture (top). After the Training Stage, in the Inference Stage, induced image features are explained through Grad-CAM, and symbolic expressions are obtained for both the tabular features and the fusion block with GP-GOMEA. The GP-GOMEA models can also be used to replace their NN counterparts, making the models more than explanations but rather an integral part of the final model, increasing its verifiability potential. In the present figure, the Multiclass Problem is used to illustrate MultiFIX.
  • Figure 2: Representative samples of input images for the Multiclass Problem. The first row displays images belonging to the star class (label $0$), and the second row displays images from the square class (label $1$). Each column contains images with a different resolution, starting with the image with $100\times100$ pixels, following $50\times50$, $25\times25$, $20\times20$, $15\times15$, $10\times10$ and, lastly, $5\times5$ pixels.
  • Figure 3: Performance result matrix for the Multiclass Problem using Balanced Accuracy (BAcc). Rows represent imaging inputs with different resolutions. Columns represent tabular inputs with different standard deviations used in the added Gaussian noise. The results consist of the average BAcc and its standard deviation over 5-fold cross validation.
  • Figure 4: Interpretability example for Multiclass Problem with image resolution $100\times100$ pixels and a tabular input with no Gaussian noise (standard deviation $0$). The new truth table for the Multiclass Problem is presented on the right, whilst the designed truth table is presented on the left.
  • Figure 5: Representative samples of input images for the Multifeature Problem.
  • ...and 4 more figures