Table of Contents
Fetching ...

Learning Robot Manipulation from Audio World Models

Fan Zhang, Michael Gienger

TL;DR

The paper targets the challenge of robotic manipulation when visual cues are ambiguous by introducing an audio-focused world modeling approach. It presents a latent flow matching mechanism in an AudioMAE-derived spectrogram latent space, enabled by a transformer-based vector field to forecast future audio and guide a flow-matching robot policy. Two tasks—real-world water filling and simulated piano playing—demonstrate that incorporating predicted future audio improves performance over lookahead-free baselines, highlighting the importance of rhythmic and pitch dynamics. The approach is modular and computationally efficient, enabling fast closed-loop predictions and flexible component substitution for broader multimodal robotic applications.

Abstract

World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns.

Learning Robot Manipulation from Audio World Models

TL;DR

The paper targets the challenge of robotic manipulation when visual cues are ambiguous by introducing an audio-focused world modeling approach. It presents a latent flow matching mechanism in an AudioMAE-derived spectrogram latent space, enabled by a transformer-based vector field to forecast future audio and guide a flow-matching robot policy. Two tasks—real-world water filling and simulated piano playing—demonstrate that incorporating predicted future audio improves performance over lookahead-free baselines, highlighting the importance of rhythmic and pitch dynamics. The approach is modular and computationally efficient, enabling fast closed-loop predictions and flexible component substitution for broader multimodal robotic applications.

Abstract

World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns.

Paper Structure

This paper contains 6 sections, 4 equations, 2 figures.

Figures (2)

  • Figure 1: Overview of the proposed method. The source audio is first encoded into a latent representation. Given the current audio segments, a flow-matching transformer estimates the generating vector field from noisy audio latents. This vector field is then used to solve the corresponding ODE, producing the future audio latents. The resulting sequence of future audio latents is decoded into audio spectrograms. Finally, a robot policy is trained using both the current and predicted future audio spectrograms along with image observations.
  • Figure 2: Experimental results. We respectively show the ground truth and the world model generation of water filling spectrogram, music spectrogram and MIDI data. The water filling spectrogram is predicted in a closed-loop manner during robot evaluation. Music pieces are generated autoregressively based on previous pieces.