Table of Contents
Fetching ...

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang

TL;DR

OpenOmni tackles open-source omnimodal learning by using a progressive, text-pivoted two-stage alignment (speech-text before image-text) to achieve near zero-shot cross-modal generalization without tri-modal data. It couples a lightweight end-to-end streaming speech decoder with direct preference optimization to deliver real-time, emotionally aware speech in bilingual settings. The approach achieves state-of-the-art results on OmniBench and vision-language/speech-language benchmarks while using far less data and a smaller model than prior open models, and it supports end-to-end generation with latency reductions of roughly 5× relative to autoregressive methods. These advances hold practical impact for real-time, expressive multimodal assistants in open research ecosystems, enabling broader reproducibility and community-driven innovation.

Abstract

Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, enabling real-time emotional speech synthesis with high fidelity. Experiments show that \name surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5x fewer training samples and a smaller model size (7B vs. 7x8B). Additionally, \name achieves real-time speech generation with <1s latency at non-autoregressive mode, reducing inference time by 5x compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

TL;DR

OpenOmni tackles open-source omnimodal learning by using a progressive, text-pivoted two-stage alignment (speech-text before image-text) to achieve near zero-shot cross-modal generalization without tri-modal data. It couples a lightweight end-to-end streaming speech decoder with direct preference optimization to deliver real-time, emotionally aware speech in bilingual settings. The approach achieves state-of-the-art results on OmniBench and vision-language/speech-language benchmarks while using far less data and a smaller model than prior open models, and it supports end-to-end generation with latency reductions of roughly 5× relative to autoregressive methods. These advances hold practical impact for real-time, expressive multimodal assistants in open research ecosystems, enabling broader reproducibility and community-driven innovation.

Abstract

Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, enabling real-time emotional speech synthesis with high fidelity. Experiments show that \name surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5x fewer training samples and a smaller model size (7B vs. 7x8B). Additionally, \name achieves real-time speech generation with <1s latency at non-autoregressive mode, reducing inference time by 5x compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%
Paper Structure (19 sections, 5 equations, 5 figures, 9 tables)

This paper contains 19 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of the motivation and architecture of OpenOmni. (a) OpenOmni adopts a progressive alignment strategy to generalize from vision-language to speech-language tasks, avoiding the need for costly tri-modal datasets and resources. (b) OpenOmni integrates a lightweight end-to-end speech decoder, enabling parallel text and speech generation while effectively reducing inference latency. (c) By utilizing DPO, OpenOmni generates emotionally coherent and context-aware speech without relying on additional control modules or handcrafted prompts. For simplicity, our core architecture is presented without the connectors between modules.
  • Figure 2: Overview of the training process of OpenOmni. To enable zero-shot omnimodal learning and real-time emotional speech generation, OpenOmni undergoes a progressive three-stage training process: (1) Speech-text alignment. A speech encoder extracts continuous speech and text features for alignment learning, equipping the large language model with speech understanding capabilities. (2) Image-text alignment. An image encoder extracts continuous image and text features, facilitating alignment learning that enhances OpenOmni's image comprehension and instruction-following abilities. This process also establishes implicit omnimodal alignment, which enables omni-understanding. (3) Text-guided speech generation. A lightweight speech decoder is trained using high-quality synthesized speech dialogue data, with a focus on direct preference optimization for emotional speech. This final stage allows OpenOmni to generate real-time and self-aware emotional speech. A text-guided module (TGM) is utilized to accelerate the training convergence.
  • Figure 3: The structure of our speech decoder. The speech decoder consists of a mixture of expert modules and multiple transformer layers, which achieves end-to-end speech unit learning through the connectionist temporal classification (CTC) loss.
  • Figure 4: Ablation study of the text-guided module (TGM). In order to explore the effect of TGM on speech generation under the two modes, we plot the change of training loss under the same setting. TGM can significantly improve the convergence speed of training and improve the effect of speech generation of the speech decoder.
  • Figure 5: Overview of text-guided module and speech decoder mode. (Left) Text-guided module fuses the hidden state and response textual feature via cross-attention, accelerating convergence speed of training without dropping the speed of speech decoding and context emotion perception. (Right) OpenOmni supports both autoregressive (AR) and non-autoregressive speech (NAR) generation. The NAR mode uses the CTC loss modeling and a 6K speech vocabulary size to enable real-time parallel speech decoding generation. The AR mode uses the NTP loss modeling and a speech vocabulary size of 16K to support streaming decoding and higher-quality speech generation. To make the training of the speech generator more stable, we design a text-guided output feature fusion method to ensure the correctness of semantic alignment in speech generation modeling.