MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning

Abdelmadjid Chergui; Grigor Bezirganyan; Sana Sellami; Laure Berti-Équille; Sébastien Fournier

MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning

Abdelmadjid Chergui, Grigor Bezirganyan, Sana Sellami, Laure Berti-Équille, Sébastien Fournier

TL;DR

The paper addresses the challenge of automatically selecting effective mixer-based architectures for multimodal data fusion. It introduces MixMAS, a sampling-based framework that decomposes architecture search into four stages—Sampling, Encoder selection, Fusion function selection, and Fusion Network selection—evaluated via micro-benchmarks on a representative subset. Experiments on MM-IMDB, AV-MNIST, and MIMIC-III show that MixMAS yields better full-model performance than the baseline M2-Mixer on MM-IMDB and AV-MNIST, while reducing parameters in some cases, with performance on MIMIC-III comparable to the baseline. The work demonstrates per-modality module dependence and provides an open-source implementation to enable broader adoption and extension to more modalities and components.

Abstract

Choosing a suitable deep learning architecture for multimodal data fusion is a challenging task, as it requires the effective integration and processing of diverse data types, each with distinct structures and characteristics. In this paper, we introduce MixMAS, a novel framework for sampling-based mixer architecture search tailored to multimodal learning. Our approach automatically selects the optimal MLP-based architecture for a given multimodal machine learning (MML) task. Specifically, MixMAS utilizes a sampling-based micro-benchmarking strategy to explore various combinations of modality-specific encoders, fusion functions, and fusion networks, systematically identifying the architecture that best meets the task's performance metrics.

MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning

TL;DR

Abstract

MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)