MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning
Abdelmadjid Chergui, Grigor Bezirganyan, Sana Sellami, Laure Berti-Équille, Sébastien Fournier
TL;DR
The paper addresses the challenge of automatically selecting effective mixer-based architectures for multimodal data fusion. It introduces MixMAS, a sampling-based framework that decomposes architecture search into four stages—Sampling, Encoder selection, Fusion function selection, and Fusion Network selection—evaluated via micro-benchmarks on a representative subset. Experiments on MM-IMDB, AV-MNIST, and MIMIC-III show that MixMAS yields better full-model performance than the baseline M2-Mixer on MM-IMDB and AV-MNIST, while reducing parameters in some cases, with performance on MIMIC-III comparable to the baseline. The work demonstrates per-modality module dependence and provides an open-source implementation to enable broader adoption and extension to more modalities and components.
Abstract
Choosing a suitable deep learning architecture for multimodal data fusion is a challenging task, as it requires the effective integration and processing of diverse data types, each with distinct structures and characteristics. In this paper, we introduce MixMAS, a novel framework for sampling-based mixer architecture search tailored to multimodal learning. Our approach automatically selects the optimal MLP-based architecture for a given multimodal machine learning (MML) task. Specifically, MixMAS utilizes a sampling-based micro-benchmarking strategy to explore various combinations of modality-specific encoders, fusion functions, and fusion networks, systematically identifying the architecture that best meets the task's performance metrics.
