Table of Contents
Fetching ...

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai

TL;DR

NEXUS addresses the challenges of tri-modal large language models by starting from a vision-language backbone and adding a lightweight audio-language alignment, coupled with an audio data synthesis pipeline to curb data and compute demands. The authors propose a modular architecture (encoder–LLM–decoder) with dedicated encoders for visual and audio streams and a shared LLM backbone, enabling outputs in language or audio modalities. Across vision, Spoken QA, ASR, S2TT, and TTS tasks, Nexus-O demonstrates competitive or superior performance to baselines, while revealing that audio input can enhance vision-language alignment in latent spaces. The work highlights practical implications for deploying omni-modal AI at industry scale with reduced pretraining requirements and improved robustness, and outlines future integration with vision generative models and broader AIGC capabilities.

Abstract

This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

TL;DR

NEXUS addresses the challenges of tri-modal large language models by starting from a vision-language backbone and adding a lightweight audio-language alignment, coupled with an audio data synthesis pipeline to curb data and compute demands. The authors propose a modular architecture (encoder–LLM–decoder) with dedicated encoders for visual and audio streams and a shared LLM backbone, enabling outputs in language or audio modalities. Across vision, Spoken QA, ASR, S2TT, and TTS tasks, Nexus-O demonstrates competitive or superior performance to baselines, while revealing that audio input can enhance vision-language alignment in latent spaces. The work highlights practical implications for deploying omni-modal AI at industry scale with reduced pretraining requirements and improved robustness, and outlines future integration with vision generative models and broader AIGC capabilities.

Abstract

This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.

Paper Structure

This paper contains 28 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Modularised Architecture, which is designed to accept any combination of input modalities and generates output in either the language or audio modality, where the Auto-Regressive (AR) audio decoder takes a special start token embedding and the last language embedding as input to generate the hierarchical discrete audio codes in an auto-regressive manner. These codes are subsequently fed into a pretrained audio generator to produce the final audio output.
  • Figure 2: Audio dataset synthesis pipeline, the upper component represents the text-to-audio branch, while the lower component corresponds to the audio-to-text branch. Both components incorporate an equal proportion of in house, real-world samples.
  • Figure 3: Averaged kernel-alignment score across different hidden layers. vision+audio: both visual and auditory modalities are concurrently fed into the model. The final fused representation is the last token in the sequence as the comprehensive summary of the integrated modalities. we can observe that incorporating audio modality (green bar) can result in better vision-language alignment at most hidden layers.