Table of Contents
Fetching ...

ModaVerse: Efficiently Transforming Modalities with LLMs

Xinyu Wang, Bohan Zhuang, Qi Wu

TL;DR

ModaVerse addresses the challenge of transforming and interpreting data across multiple modalities by coupling input adaptors with an LLM-as-agent. It introduces I/O Alignment, a language-level mechanism that aligns multimodal inputs to the LLM and directs the LLM to invoke external text-to-x generators, enabling a single-stage, data-efficient training regime. The approach achieves competitive results on diverse benchmarks using roughly 40M trainable parameters, while requiring far less data and training time than several prior MLLMs. This work offers a practical path toward scalable, plug-and-play multi-modal reasoning and generation with reduced computational overhead.

Abstract

Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work, we introduce ModaVerse, a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images, videos, and audio. Predominant MLLM frameworks have largely relied on the alignment of latent spaces of textual and non-textual features. This alignment process, which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data, often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies, we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models, avoiding the complexities associated with latent feature alignments, and simplifying the multiple training stages of existing MLLMs into a single, efficient process. This conceptual advancement leads to significant reductions in both data and computational costs. By conducting experiments on several benchmarks, we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage and training duration.

ModaVerse: Efficiently Transforming Modalities with LLMs

TL;DR

ModaVerse addresses the challenge of transforming and interpreting data across multiple modalities by coupling input adaptors with an LLM-as-agent. It introduces I/O Alignment, a language-level mechanism that aligns multimodal inputs to the LLM and directs the LLM to invoke external text-to-x generators, enabling a single-stage, data-efficient training regime. The approach achieves competitive results on diverse benchmarks using roughly 40M trainable parameters, while requiring far less data and training time than several prior MLLMs. This work offers a practical path toward scalable, plug-and-play multi-modal reasoning and generation with reduced computational overhead.

Abstract

Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work, we introduce ModaVerse, a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images, videos, and audio. Predominant MLLM frameworks have largely relied on the alignment of latent spaces of textual and non-textual features. This alignment process, which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data, often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies, we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models, avoiding the complexities associated with latent feature alignments, and simplifying the multiple training stages of existing MLLMs into a single, efficient process. This conceptual advancement leads to significant reductions in both data and computational costs. By conducting experiments on several benchmarks, we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage and training duration.
Paper Structure (11 sections, 1 equation, 5 figures, 7 tables)

This paper contains 11 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparative illustration of MLLM paradigms: (a) Multi-modal Pre-training, where new modules such as vision encoders and decoders are integrated within the standard LLM framework. (b) Adaptor Training, illustrating the use of projection layers to connect LLMs to pre-existing modules. (c) LLM as an Agent, highlighting the strategic application of prompts in conjunction with external tools. (d) Adaptor+Agent (ours), transforming modalities with efficient language-based Input/Output (I/O) alignment. E, D, and L represent the Encoder, Decoder, and Linear Layer respectively. T-to-x denotes a text-to-x generative model, where x can be Image, Video, and Audio.
  • Figure 2: Comparison of the overview schematic of recent proposed MLLMs. L represents linear projection layers.
  • Figure 3: Overview of the Proposed ModaVerse Pipeline. In the input projection stage, multi-modal inputs $I'$ are aligned to the LLM's space $O_{1}$ using a series of trainable linear layers. During the meta-response generation stage, LLM is fine-tuned with a LoRA adaptor, prompting the generation of a meta-response $O_{2}$. In the final response generation stage, additional pretrained text-to-x models are utilized to generate the ultimate multi-modal response $O'$ based on the parsed meta response.
  • Figure 4: Qualitative examples of the proposed ModaVerse interpreting and producing data presented in combinations of various modalities. Blue and Red dashed boxes represent input and output respectively.
  • Figure 5: Failure cases of ModaVerse. (a) The model can only generate entirely new images and cannot modify the original pixels. (b) The model tends to generate irrelevant outputs in the absence of language instructions during the input phase.