Table of Contents
Fetching ...

Spider: Any-to-Many Multimodal LLM

Jinxiang Lai, Jie Zhang, Jun Liu, Jian Li, Xiaocheng Lu, Song Guo

TL;DR

Spider is introduced, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}.

Abstract

Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, an Any-to-Many Instruction Template designed for producing Xs signal prompts, and a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates learning the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG tasks in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field. Code: https://github.com/Layjins/Spider

Spider: Any-to-Many Multimodal LLM

TL;DR

Spider is introduced, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}.

Abstract

Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, an Any-to-Many Instruction Template designed for producing Xs signal prompts, and a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates learning the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG tasks in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field. Code: https://github.com/Layjins/Spider

Paper Structure

This paper contains 27 sections, 2 equations, 15 figures, 22 tables.

Figures (15)

  • Figure 1: (a) The X-to-X (Any-to-Any) MLLMs support the input and output of pairwise modalities 'Text + X'. (b) Our X-to-Xs (Any-to-Many) Spider model produces many modalities 'Text + Xs'. X denotes any-one-modality such as one of image or video or audio, and Xs means arbitrary-combination-modalities such as the combination of image and video and audio. A 'Text + Xs' example is shown in (d). (c) The Spider structure comprises four parts including Encoders, LLM, Decoders-Controller, and Decoders. The LLM is utilized as the core to process input multimodal information encoded by Encoders for semantic understanding and reasoning. Then, the LLM not only generates Text response, but also produces Text Prompt (T-Prompt) and Modality Prompt (M-Prompt) for the subsequent Decoders-Controller to control multimodal Decoders. (d) With Any-to-Many Instruction Template, T-Prompt and M-Prompt are gathered to form many-modal signal prompts, which are able to control Decoders to generate many-modal contents. There is an example of many-modal signal prompts: '$<IMAGE>$Forbidden City$[{IMAGE}_0] </IMAGE>$. $<AUDIO>$Peking Opera$[{AUDIO}_0] </AUDIO>$.' Where '$<IMAGE> ... </IMAGE>$' and '$<AUDIO> ... </AUDIO>$' are the begin-end signal pairs of image and audio, respectively. 'Forbidden City' and 'Peking Opera' are T-Prompt. '$[{IMAGE}_0]$' and '$[{AUDIO}_0]$' are M-Prompt. Overall, we call this Modality-wise Grouping, i.e., each modality signal prompt is grouped by the corresponding begin-end signal pair containing the T-Prompt and M-Prompt inside. It allows arbitrary concatenation of different modality signal prompts.
  • Figure 2: (a) Efficient Decoders-Controller consists of Unified Decoder Projector (UDP) and TM-Fusion (TMF), which enables the LLM to efficiently control multiple task Decoders to generate many-modal contents. $X$ means the variable or function corresponding to a specific X-modality, such as image or audio or video, etc. M-Prompt embedding $M_e^X=e(M^X)$ denotes obtaining the hidden embedding of $M^X$ extracted by LLM. (b) The M-Alignment Loss and M-Reconstruction Loss for optimizing the Decoders-Controller. (c) The intuitive embedding space relationship corresponding to (b).
  • Figure 3: (a) Multiple Projectors (MP) for LLM-Decoders alignment. (b) Our Unified Decoder Projector (UDP).
  • Figure 4: Any-to-Many Instruction Template. (a) Input Question Format and Output Answer Format. (b) An example of Question and Answer. Best view in color corresponding to (a). (c) TaskPrompt to distinguish different output modes. (d) M-Prompt $M^{X}$ to identify different output modalities, where $i$ is set to 0.
  • Figure 5: Examples of TMM dataset.
  • ...and 10 more figures