Table of Contents
Fetching ...

Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound

Jiahua Wang, Shannan Yan, Leqi Zheng, Jialong Wu, Yaoxin Mao

TL;DR

The paper tackles the lack of formal multisensory world models by introducing Audio-Visual World Models (AVWM), a POMDP-based framework that jointly models synchronized visual and binaural audio dynamics under precise actions and rewards. It introduces AVW-4k, a large dataset of audio-visual trajectories with action and reward annotations, and AV-CDiT, a diffusion-transformer with modality experts and a three-stage training regime to balance visual and auditory learning. Extensive experiments show high-fidelity multimodal predictions and practical benefits for continuous audio-visual navigation, including planning with a trained AVWM. The work demonstrates that integrated multisensory imagination can enhance planning and embodied AI in complex environments.

Abstract

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control with task reward prediction. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations, fine-grained actions, and task rewards. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations and reward signals across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities with reward. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent's performance.

Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound

TL;DR

The paper tackles the lack of formal multisensory world models by introducing Audio-Visual World Models (AVWM), a POMDP-based framework that jointly models synchronized visual and binaural audio dynamics under precise actions and rewards. It introduces AVW-4k, a large dataset of audio-visual trajectories with action and reward annotations, and AV-CDiT, a diffusion-transformer with modality experts and a three-stage training regime to balance visual and auditory learning. Extensive experiments show high-fidelity multimodal predictions and practical benefits for continuous audio-visual navigation, including planning with a trained AVWM. The work demonstrates that integrated multisensory imagination can enhance planning and embodied AI in complex environments.

Abstract

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control with task reward prediction. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations, fine-grained actions, and task rewards. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations and reward signals across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities with reward. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent's performance.

Paper Structure

This paper contains 13 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: From unimodal to audio-visual world models. While embodied agents in the real world perceive through multiple sensory modalities including vision and audio, existing world models predominantly focus on visual observations alone. Our work introduces Audio-Visual World Models (AVWMs), the first framework to jointly simulate synchronized audio-visual dynamics under precise action control with task reward prediction.
  • Figure 2: Dataset statistics and trajectory examples of the proposed AVW-4k. Left: Proportions of trajectories corresponding to the three motion patterns in the train/validation/test splits. Middle: Distributions of trajectory lengths and geodesic distances to the sound source across all frames in the dataset. Right: Representative trajectories for each motion pattern, with audios shown as binaural spectrograms.
  • Figure 3: Overview of the proposed AV-CDiT architecture.
  • Figure 4: Illustration of the stagewise training strategy used for AV-CDiT.
  • Figure 5: Qualitative analysis. Left and right respectively show image and audio generation results of our model and two ablated variants under the fixed-step and rollout modes.
  • ...and 2 more figures