OneLLM: One Framework to Align All Modalities with Language

Jiaming Han; Kaixiong Gong; Yiyuan Zhang; Jiaqi Wang; Kaipeng Zhang; Dahua Lin; Yu Qiao; Peng Gao; Xiangyu Yue

OneLLM: One Framework to Align All Modalities with Language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

TL;DR

OneLLM introduces a unified multimodal LLM framework that uses a single universal encoder and a mixture-of-experts universal projection module to align eight diverse modalities to language. It employs a progressive multimodal alignment pipeline followed by unified multimodal instruction tuning on a large, multimodal dataset to enable instruction-following across modalities. The approach achieves strong performance across 25 benchmarks, often surpassing modality-specific models and other MLLMs, while maintaining scalability to additional modalities. The work demonstrates the feasibility and benefits of a single, scalable encoder-projection-LLM stack for broad multimodal alignment and reasoning tasks.

Abstract

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

OneLLM: One Framework to Align All Modalities with Language

TL;DR

Abstract

Paper Structure (33 sections, 6 equations, 6 figures, 12 tables)

This paper contains 33 sections, 6 equations, 6 figures, 12 tables.

Introduction
Related Work
Method
Model Architecture
Progressive Multimodal Alignment
Unified Multimodal Instruction Tuning
Experiment
Implementation Details
Quantitative Evaluation
Ablation Experiments
Qualitative Analysis
Conclusion
Appendix Overview
Additional Ablation Experiments
Frozen vs. Trainable Encoder.
...and 18 more sections

Figures (6)

Figure 1: Comparisons of Different Multimodal LLMs. Vision LLM: one image encoder and projection module. Multimodal (MM) LLM: modality-specific encoder and projection module. OneLLM: a universal encoder, a universal projection module and modality tokens $\{\mathrm{modal}\}$ to switch between modalities. Bottom: OneLLM expands supported modalities from three to eight.
Figure 2: The Architecture of OneLLM. OneLLM consists of modality tokenizers, a universal encoder, a universal projection module (UPM) and an LLM. The modality tokenizer is a 2D/1D convolution layer to transform the input signal into a sequence of tokens. For simplicity, we omit video, depth/normal map tokenizers. The universal encoder is a frozen vision-language model (i.e. CLIP radford2021learning) to extract high dimensional features. The UPM is composed of several projection experts and modality routers to align the input signal with language. For the alignment stage, we train modality tokenizers and UPM, and keep LLM frozen. For the instruction tuning stage, we only train the LLM and keep other models frozen. In a forward pass of UPM, we concatenate the input and modality tokens as input. Then we only take the modality tokens as a summary of the input signal and feed it into LLM for multimodal understanding.
Figure 3: Qualitative Results on Eight Modalities. All demo inputs are from the web or the testing set of corresponding modalities.
Figure 4: Additional Qualitative Image Demos.
Figure 5: Additional Qualitative Video, Audio and Point Cloud Demos.
...and 1 more figures

OneLLM: One Framework to Align All Modalities with Language

TL;DR

Abstract

OneLLM: One Framework to Align All Modalities with Language

Authors

TL;DR

Abstract

Table of Contents

Figures (6)