OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
Sai Koneru, Matthias Huck, Jan Niehues
TL;DR
OmniFusion presents an end-to-end architecture that fuses a pretrained multimodal foundation model with a multilingual translation LLM via a gated, multi-layer fusion mechanism. By incorporating hidden states from the MMFM's first, middle, and last layers and aligning them through OCR-driven prompts, the model simultaneously handles speech-only, speech-with-image, and text-with-image translation while reducing SimulST latency by about 1 second compared to cascaded baselines. The approach achieves strong multimodal translation performance, including state-of-the-art results on CoMMuTE for image-text translation, and reveals that early MMFM layers contribute most to cross-modal representations. Layer analyses and ablations provide practical guidance for efficient cross-modal fusion, while limitations point to future work on broader modalities and instruction-driven multimodal behavior.
Abstract
There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation quality\footnote{Code is available at https://github.com/saikoneru/OmniFusion}.
