Table of Contents
Fetching ...

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, Nan Duan

Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at 25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}
Paper Structure (12 sections, 7 equations, 4 figures, 3 tables)

This paper contains 12 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: OmniForcing breaks the latency barrier for joint audio-visual generation. Top: Our framework achieves real-time streaming at $\sim$25 FPS with an ultra-low Time-To-First-Chunk (TTFC) of $\sim$0.7s. Bottom: The bidirectional teacher (LTX-2) requires $\sim$197s to generate the sequence offline. OmniForcing maintains visual and acoustic fidelity on par with the teacher model.
  • Figure 2: The three-stage OmniForcing distillation pipeline. Stage I employs Distribution Matching Distillation (DMD) yin2024oneyin2024improved to adapt the model for few-step, fast denoising. Stage II utilizes causal ODE regression to adapt the network weights to the asymmetric block-causal mask. Stage III implements joint Self-Forcing huang2025self training by autoregressively unrolling the generation process to mitigate exposure bias.
  • Figure 3: Asymmetric Block-Causal Masking. The vertical axis denotes query tokens and the horizontal axis denotes key tokens. Modalities are synchronized via 1s macro-blocks. Each audio block ($B^a$) contains 25 latent frames (one token each), whereas each video block ($B^v$) contains 3 latent frames patchified into $3 \times 384$ tokens. Unmasked tokens include the Global Prefix (orange, $V_0/A_0$) and Audio Sink tokens (red, $s$). Blue regions denote allowed attention (bidirectional intra-block, strictly causal inter-block), while white regions mask future keys to prevent information leakage.
  • Figure 4: Qualitative comparison across diverse scenes. Each example shows generated frames with synchronized waveforms and Mel spectrograms. OmniForcing produces voice layered with bird calls at a seaside scene (top-left), sustained speech at a podium presentation (top-right), a precisely timed cat meow (bottom-left), and blended narration with sewing machine sounds (bottom-right).