Table of Contents
Fetching ...

Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

Zhiqiang Tang, Zihan Zhong, Tong He, Gerald Friedland

TL;DR

This work tackles the gap in AutoML for multimodal data by evaluating a comprehensive set of practical tricks across image, text, and tabular modalities. It introduces a benchmark of 22 datasets that cover all four modality combinations and both classification and regression tasks, enabling systematic assessment of fusion, augmentation, tabular-to-text conversion, cross-modal alignment, and missing-modality handling. A learnable ensemble (ensemble selection) integrates the tricks into a single robust pipeline, with late fusion and cross-modal alignment identified as particularly impactful, and converting categorical tabular features to text often providing consistent gains. The study provides a pragmatic, scalable recipe for multimodal AutoML and offers guidance on which tricks generalize across datasets, informing both researchers and practitioners seeking robust multimodal solutions in real-world settings.

Abstract

This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.

Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

TL;DR

This work tackles the gap in AutoML for multimodal data by evaluating a comprehensive set of practical tricks across image, text, and tabular modalities. It introduces a benchmark of 22 datasets that cover all four modality combinations and both classification and regression tasks, enabling systematic assessment of fusion, augmentation, tabular-to-text conversion, cross-modal alignment, and missing-modality handling. A learnable ensemble (ensemble selection) integrates the tricks into a single robust pipeline, with late fusion and cross-modal alignment identified as particularly impactful, and converting categorical tabular features to text often providing consistent gains. The study provides a pragmatic, scalable recipe for multimodal AutoML and offers guidance on which tricks generalize across datasets, informing both researchers and practitioners seeking robust multimodal solutions in real-world settings.

Abstract

This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.

Paper Structure

This paper contains 27 sections, 1 figure, 25 tables.

Figures (1)

  • Figure 1: Weight statistics of tricks in Ensemble Selection. Each bar represents the average weight of a trick for a specific modality combination or across the entire benchmark (All). The red dashed lines indicate the average levels; bars above this line denote tricks with above-average importance.