Table of Contents
Fetching ...

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng

TL;DR

Stream-Omni addresses the challenge of building a unified large language–vision–speech model that can interact across text, vision, and speech with limited tri-modal data. It achieves data-efficient modality alignment by applying sequence-dimension fusion for vision–text and layer-dimension mapping for speech–text, enabled by bottom/top speech layers and a CTC-based mapping that transfers text capabilities to speech. The approach yields strong performance on visual understanding, speech interaction, and vision-grounded speech tasks, using only about 23k hours of speech data, and enables intermediate text outputs during speech interactions for richer multimodal experiences. This work advances practical omni-modal AI by enabling seamless see-while-hear interactions with flexible modality combinations and efficient training.

Abstract

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

TL;DR

Stream-Omni addresses the challenge of building a unified large language–vision–speech model that can interact across text, vision, and speech with limited tri-modal data. It achieves data-efficient modality alignment by applying sequence-dimension fusion for vision–text and layer-dimension mapping for speech–text, enabled by bottom/top speech layers and a CTC-based mapping that transfers text capabilities to speech. The approach yields strong performance on visual understanding, speech interaction, and vision-grounded speech tasks, using only about 23k hours of speech data, and enables intermediate text outputs during speech interactions for richer multimodal experiences. This work advances practical omni-modal AI by enabling seamless see-while-hear interactions with flexible modality combinations and efficient training.

Abstract

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Paper Structure

This paper contains 24 sections, 7 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of modality alignments in Stream-Omni and previous works.
  • Figure 2: Architecture of Stream-Omni. Right: Interactions under various modality combinations.
  • Figure 3: Diagram of top speech layers.
  • Figure 4: Case Study of Stream-Omni (detail understanding).
  • Figure 5: Case Study of Stream-Omni (long response).