Table of Contents
Fetching ...

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji

TL;DR

This work tackles the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing, and presents multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models.

Abstract

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

TL;DR

This work tackles the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing, and presents multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models.

Abstract

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
Paper Structure (16 sections, 9 equations, 6 figures, 9 tables)

This paper contains 16 sections, 9 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Long-Video to Audio (LV2A) task overview. The challenge is framed as training models on fixed-length segments while requiring them to generalize to variable-length (long-form) audio outputs during inference.
  • Figure 2: We analyze the role of positional embeddings in V2A models such as MMAudio cheng2025mmaudio, built on MMDiT flux2024. Without positional embeddings (a), MMAudio fails to capture temporal structure, producing redundant audio dominated by prominent visual objects (e.g., car crashing). With adjusted positional embeddings (b), alignment improves but sound quality degrades over long sequences (see scene C). (c) On UnAV100 geng2023unav100, both configurations show performance drops across durations, with MMAudio without positional embeddings performing worst in distribution matching (FD$_{PANN}\downarrow$) and multimodal alignment (IB-Score$\uparrow$).
  • Figure 3: Overview of our proposed framework. Left: A comprehensive end-to-end flow-matching model that operates across both multimodal and single-modal blocks, handling inputs in both compressed and original spaces. Middle: A temporal routing mechanism designed to efficiently process tokens in a time-aware manner. Right: A multimodal routing strategy that leverages strong correlations between the two modalities for enhanced integration.
  • Figure 4: Visualization of audio spectogram from MMHNet and competing methods on UnAV100.
  • Figure 5: Comparison with past methods on various duration splits of audio-video data on UnAV100 (FD$_\textrm{PANNs}$$\downarrow$ and IB-Score $\uparrow$).
  • ...and 1 more figures