Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Chang Nie; Tianchen Deng; Guangming Wang; Zhe Liu; Hesheng Wang

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang

Abstract

While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Abstract

Paper Structure (87 sections, 23 equations, 19 figures, 8 tables)

This paper contains 87 sections, 23 equations, 19 figures, 8 tables.

Introduction
Related Work
Vision--Language--Action Policies
Audio in Robotics
Datasets and Benchmarks
The VSLA Paradigm
VSLA Observation and Action Space
Environment and Signals
Observation Interface
Action Space
Temporal Mismatch and the Blind Execution Interval
Persistence Mismatch Between Vision and Sound
Decision Cadence and Latency
Action Chunking and the Blind Execution Interval
Sound Causality and Evidence Vanishing
...and 72 more sections

Figures (19)

Figure 1: The HEAR framework for Vision-Sound-Language-Action (VSLA) manipulation.Challenge: Upgrading standard VLA to the VSLA paradigm. We highlight a critical challenge: VLA models frequently miss transient acoustic cues due to the Blind Execution Interval (BEI), a structural blind spot caused by system latency and open-loop action chunking. Scenarios: Everyday sound-centric manipulation tasks require robots to perceive diverse acoustic cues, including speech, trigger, continuous, and interactive sounds. Solution: To address the BEI and handle complex sounds, the HEAR framework integrates four components: a causal audio memory (Historizer), an omni-sensory reasoning model (Envisioner), a predictive audio world model (Advancer), and a smooth flow-matching policy (Realizer). Training & Evaluation: We introduce OpenX-Sound for scalable pretraining, alongside HEAR-Bench and physical robot deployments for rigorous, sound-causal evaluation.
Figure 2: The HEAR framework architecture. The Historizer processes audio packets with a Streaming Stateful Transformer that maintains a compact causal memory $h_{t_k}$ and bridges execution gaps, including the Blind Execution Interval induced by open-loop chunking. The Envisioner employs a hierarchical design. Its high-level omni-modal model integrates multimodal inputs, including vision, instruction, robot state, and audio context, and outputs a semantic latent $z_{t_k}$ and a key--value cache $\mathrm{KV}_{t_k}$. It also predicts a text stage description $\hat{y}^{\text{json}}_{t_k}$. Its low-level model reuses $\mathrm{KV}_{t_k}$ together with the current state and extracts a control feature $u_{t_k}$. The Advancer is a decoder-only transformer that predicts a near-future audio code sequence $\mathbf{z}^a_{t_k\rightarrow t_{k+1}}$ from $z_{t_k}$ during training. The Realizer synthesizes smooth action chunks via Conditional Flow Matching conditioned on $u_{t_k}$.
Figure 3: The Historizer module. It processes continuous audio streams into discrete packets and uses a streaming stateful transformer to maintain a persistent causal memory. This mechanism ensures that transient acoustic events occurring during blind execution intervals are preserved for the next decision cycle.
Figure 4: The Envisioner module. A hierarchical reasoning architecture fuses multimodal inputs to guide manipulation. The high-level omni-modal model extracts semantic latents and stage descriptions, while the low-level model reuses the resulting key-value cache alongside proprioceptive data to efficiently generate control features.
Figure 5: The Advancer module. In scenarios requiring sustained waiting with quasi-static visual observations, this decoder-only audio world model predicts near-future acoustic codes. This predictive objective grounds the shared latent representation in continuous time, helping the policy maintain stability and temporal awareness.
...and 14 more figures

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Abstract

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Authors

Abstract

Table of Contents

Figures (19)