Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

Siyuan Li; Jiani Lu; Yu Song; Xianren Li; Bo An; Peng Liu

Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

Siyuan Li, Jiani Lu, Yu Song, Xianren Li, Bo An, Peng Liu

TL;DR

The paper tackles precise robotic manipulation under partial observability by introducing a hierarchical, audio-centric fusion of audio, vision, and proprioception. It presents a binary-branched fusion module and an interaction modeling module to capture higher-order cross-modal dependencies, conditioned on audio, and couples this with a diffusion-based policy for continuous action generation. Real-world experiments on pouring and cabinet tasks show substantial performance gains over baselines and reveal improved robustness and generalization, supported by mutual information analysis that confirms the informative role of audio cues. The findings highlight the practical value of leveraging sparse acoustic signals in multimodal robotic perception and control, especially for contact-rich interactions.

Abstract

Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic cues, and then explicitly models higher-order cross-modal interactions to capture complementary dependencies among modalities. The fused representation is leveraged by a diffusion-based policy to directly generate continuous robot actions from multimodal observations. The combination of end-to-end learning and hierarchical fusion structure enables the policy to exploit task-relevant acoustic information while mitigating interference from less informative modalities. The proposed method has been evaluated on real-world robotic manipulation tasks, including liquid pouring and cabinet opening. Extensive experiment results demonstrate that our approach consistently outperforms state-of-the-art multimodal fusion frameworks, particularly in scenarios where acoustic cues provide task-relevant information not readily available from visual observations alone. Furthermore, a mutual information analysis is conducted to interpret the effect of audio cues in robotic manipulation via multimodal fusion.

Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

TL;DR

Abstract

Paper Structure (21 sections, 15 equations, 6 figures, 4 tables)

This paper contains 21 sections, 15 equations, 6 figures, 4 tables.

Introduction
Related Work
Visual and Audio Representations
Audio-Visual Fusion
Problem formulation and preliminaries
Problem Formulation
Diffusion Policy
Method
Separate Feature Encoder
Binary-Branched Fusion Module
Interaction Modeling Module
Experiments
Setup
Baselines and Metrics
Baselines
...and 6 more sections

Figures (6)

Figure 1: Example trajectory of visual, audio, and proprioception observations in a real-world pouring task. The three modalities are temporally synchronized, yet exhibit markedly different characteristics. Visual and proprioceptive signals vary smoothly over time, reflecting gradual motion and pose changes, whereas audio signals are sparse and exhibit abrupt transients tightly coupled with physical interactions. These empirical observations suggest that acoustic information encodes interaction-specific cues that are complementary to vision, and these three modalities are inherently hetergeneous.
Figure 2: The architecture of the proposed hierarchical fusion method. Each modality is initially encoded, followed by the Binary-Branched Fusion Module, which aggregates the extracted features. The resulting intermediate features then undergo interactive fusion to yield the final embedding $z_t$, which serves as the conditioning input to the diffusion policy. This policy iteratively denoises the trajectory from random noise to executable robot actions.
Figure 3: Overview of the experimental environment. Both experiments involve visual, acoustic, and proprioceptive sensing.
Figure 4: Measurement of the liquid level. We measure the height of the liquid after the robot arm has poured. The air column height is then computed as the container height minus the measured liquid height.
Figure 5: Measurement in the cabinet opening task. The left is the measurement of the remaining sliding distance of the cabinet door. The right is a top view of the cabinet on the desk. The coordinates of two reference points on the cabinet base diameter are recorded before (green) and after (orange) the gripper pulls the door, enabling computation of the cabinet’s translational displacement and rotational change.
...and 1 more figures

Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

TL;DR

Abstract

Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)