Table of Contents
Fetching ...

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

Leyan Xue, Changqing Zhang, Kecheng Xue, Xiaohong Liu, Guangyu Wang, Zongbo Han

TL;DR

MULTIBENCH++ addresses the lack of universal, scalable benchmarks for multimodal fusion by introducing a large-scale, domain-adaptive benchmark that aggregates 30+ datasets across 15 modalities and 20 tasks, complemented by an open-source evaluation pipeline and automated hyperparameter tuning. The framework supports multiple Transformer-based fusion architectures and logit-level fusion methods, enabling fair, reproducible comparisons and robust testing of modern fusion models. Empirical results show that advanced fusion approaches like CACF generalize across diverse domains, while the benefit of model complexity depends on data complexity, underscoring the need for dataset-informed model selection. By providing a rigorous, reproducible testing ground, MULTIBENCH++ aims to accelerate the development of general-purpose multimodal architectures with broad real-world impact in areas such as remote sensing, healthcare, and affective computing.

Abstract

Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

TL;DR

MULTIBENCH++ addresses the lack of universal, scalable benchmarks for multimodal fusion by introducing a large-scale, domain-adaptive benchmark that aggregates 30+ datasets across 15 modalities and 20 tasks, complemented by an open-source evaluation pipeline and automated hyperparameter tuning. The framework supports multiple Transformer-based fusion architectures and logit-level fusion methods, enabling fair, reproducible comparisons and robust testing of modern fusion models. Empirical results show that advanced fusion approaches like CACF generalize across diverse domains, while the benefit of model complexity depends on data complexity, underscoring the need for dataset-informed model selection. By providing a rigorous, reproducible testing ground, MULTIBENCH++ aims to accelerate the development of general-purpose multimodal architectures with broad real-world impact in areas such as remote sensing, healthcare, and affective computing.

Abstract

Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.

Paper Structure

This paper contains 172 sections, 11 equations, 1 figure, 10 tables.

Figures (1)

  • Figure 1: An overview of the MULTIBENCH++ framework, highlighting our core contributions. (Left) We introduce a broader and deeper collection of datasets, significantly expanding into more specialized domains and data modalities. (Center) We integrate more advanced fusion paradigms, including feature-level transformer-based fusion and decision-level fusion. (Right) We provide an automated hyper-parameter tuning platform, powered by Optuna, to ensure robust and reproducible evaluation.