From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities
Shixin Jiang, Jiafeng Liang, Jiyuan Wang, Xuan Dong, Heng Chang, Weijiang Yu, Jinhua Du, Ming Liu, Bing Qin
TL;DR
Omni-MLLMs aim to unify multi-modal understanding and generation beyond modality-specific models by embedding diverse non-linguistic inputs into LLM spaces. The paper presents a four-component architecture (encoding, alignment, interaction, generation) and a taxonomy distinguishing continuous, discrete, and hybrid encoding, plus projection and embedding alignment strategies. It outlines a two-stage training regime—alignment pre-training and instruction fine-tuning—and discusses data construction, benchmark suites, and evaluation across uni- and cross-modal tasks. Key contributions include the first comprehensive survey of Omni-MLLMs, a detailed taxonomy, and a synthesis of training recipes, datasets, and challenges with directions for expansion, cross-modal capabilities, and real-world applications.
Abstract
To tackle complex tasks in real-world scenarios, more researchers are focusing on Omni-MLLMs, which aim to achieve omni-modal understanding and generation. Beyond the constraints of any specific non-linguistic modality, Omni-MLLMs map various non-linguistic modalities into the embedding space of LLMs and enable the interaction and understanding of arbitrary combinations of modalities within a single model. In this paper, we systematically investigate relevant research and provide a comprehensive survey of Omni-MLLMs. Specifically, we first explain the four core components of Omni-MLLMs for unified multi-modal modeling with a meticulous taxonomy that offers novel perspectives. Then, we introduce the effective integration achieved through two-stage training and discuss the corresponding datasets as well as evaluation. Furthermore, we summarize the main challenges of current Omni-MLLMs and outline future directions. We hope this paper serves as an introduction for beginners and promotes the advancement of related research. Resources have been made publicly available at https://github.com/threegold116/Awesome-Omni-MLLMs.
