Table of Contents
Fetching ...

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang

Abstract

While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Abstract

While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.
Paper Structure (87 sections, 23 equations, 19 figures, 8 tables)

This paper contains 87 sections, 23 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: The HEAR framework for Vision-Sound-Language-Action (VSLA) manipulation.Challenge: Upgrading standard VLA to the VSLA paradigm. We highlight a critical challenge: VLA models frequently miss transient acoustic cues due to the Blind Execution Interval (BEI), a structural blind spot caused by system latency and open-loop action chunking. Scenarios: Everyday sound-centric manipulation tasks require robots to perceive diverse acoustic cues, including speech, trigger, continuous, and interactive sounds. Solution: To address the BEI and handle complex sounds, the HEAR framework integrates four components: a causal audio memory (Historizer), an omni-sensory reasoning model (Envisioner), a predictive audio world model (Advancer), and a smooth flow-matching policy (Realizer). Training & Evaluation: We introduce OpenX-Sound for scalable pretraining, alongside HEAR-Bench and physical robot deployments for rigorous, sound-causal evaluation.
  • Figure 2: The HEAR framework architecture. The Historizer processes audio packets with a Streaming Stateful Transformer that maintains a compact causal memory $h_{t_k}$ and bridges execution gaps, including the Blind Execution Interval induced by open-loop chunking. The Envisioner employs a hierarchical design. Its high-level omni-modal model integrates multimodal inputs, including vision, instruction, robot state, and audio context, and outputs a semantic latent $z_{t_k}$ and a key--value cache $\mathrm{KV}_{t_k}$. It also predicts a text stage description $\hat{y}^{\text{json}}_{t_k}$. Its low-level model reuses $\mathrm{KV}_{t_k}$ together with the current state and extracts a control feature $u_{t_k}$. The Advancer is a decoder-only transformer that predicts a near-future audio code sequence $\mathbf{z}^a_{t_k\rightarrow t_{k+1}}$ from $z_{t_k}$ during training. The Realizer synthesizes smooth action chunks via Conditional Flow Matching conditioned on $u_{t_k}$.
  • Figure 3: The Historizer module. It processes continuous audio streams into discrete packets and uses a streaming stateful transformer to maintain a persistent causal memory. This mechanism ensures that transient acoustic events occurring during blind execution intervals are preserved for the next decision cycle.
  • Figure 4: The Envisioner module. A hierarchical reasoning architecture fuses multimodal inputs to guide manipulation. The high-level omni-modal model extracts semantic latents and stage descriptions, while the low-level model reuses the resulting key-value cache alongside proprioceptive data to efficiently generate control features.
  • Figure 5: The Advancer module. In scenarios requiring sustained waiting with quasi-static visual observations, this decoder-only audio world model predicts near-future acoustic codes. This predictive objective grounds the shared latent representation in continuous time, helping the policy maintain stability and temporal awareness.
  • ...and 14 more figures