Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities
Sai Munikoti, Ian Stewart, Sameera Horawalavithana, Henry Kvinge, Tegan Emerson, Sandra E Thompson, Karl Pazdernik
TL;DR
This survey analyzes generalist multimodal models (GMMs) that operate across multiple modalities beyond text and image. It introduces a three-axis taxonomy—Unifiability, Modularity, Adaptability—to classifying architectures and training strategies, and develops a pipeline framework (pre-processing, universal learning, decoding) to unify modalities around a shared backbone, typically an LLM. The paper surveys unimodal foundation models (language, vision, time series, graphs) and situates GMMs within cutting-edge architectures (e.g., Uni-Perceiver, META-TRANSFORMER, MPLUG-2, NEXT-GPT), while discussing design choices like homogenized encodings, modular adapters, and retrieval-augmented generation. Key challenges identified include data scarcity for new modalities, weak benchmarks, lack of theory and trust frameworks, high compute costs, and modality-encoding misalignments. It concludes with future directions spanning modality expansion, multimodal prompting, scalable modular architectures, human-model interaction, and emergence of advanced capabilities, aiming to accelerate progress toward true generalist multimodal intelligence.
Abstract
Multimodal models are expected to be a critical component to future advances in artificial intelligence. This field is starting to grow rapidly with a surge of new design elements motivated by the success of foundation models in natural language processing (NLP) and vision. It is widely hoped that further extending the foundation models to multiple modalities (e.g., text, image, video, sensor, time series, graph, etc.) will ultimately lead to generalist multimodal models, i.e. one model across different data modalities and tasks. However, there is little research that systematically analyzes recent multimodal models (particularly the ones that work beyond text and vision) with respect to the underling architecture proposed. Therefore, this work provides a fresh perspective on generalist multimodal models (GMMs) via a novel architecture and training configuration specific taxonomy. This includes factors such as Unifiability, Modularity, and Adaptability that are pertinent and essential to the wide adoption and application of GMMs. The review further highlights key challenges and prospects for the field and guide the researchers into the new advancements.
