Table of Contents
Fetching ...

Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation

Xiangyi Wei, Haotian Zhang, Xinyi Cao, Siyu Xie, Weifeng Ge, Yang Li, Changbo Wang

TL;DR

Audio-VLA addresses the limitations of vision-only VLA models by incorporating contact audio to capture dynamic manipulation cues. By fusing visual, audio, and proprioceptive inputs through a LoRA-tuned multi-modal encoder and a large-language-model-based policy (Llama2), it enables richer dynamic understanding and action generation, while the TCR metric provides a systematic measure of ongoing process perception. The approach is validated in audio-augmented LIBERO and RLBench simulations and on real-world tasks, achieving superior performance and demonstrating robustness under domain shifts. The work also provides audio-enhanced simulation environments and an open-source pathway, highlighting the practical impact of multimodal perception for reliable, contact-rich robotic manipulation.

Abstract

The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA overcomes the vision-only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLA's superior performance over vision-only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities.

Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation

TL;DR

Audio-VLA addresses the limitations of vision-only VLA models by incorporating contact audio to capture dynamic manipulation cues. By fusing visual, audio, and proprioceptive inputs through a LoRA-tuned multi-modal encoder and a large-language-model-based policy (Llama2), it enables richer dynamic understanding and action generation, while the TCR metric provides a systematic measure of ongoing process perception. The approach is validated in audio-augmented LIBERO and RLBench simulations and on real-world tasks, achieving superior performance and demonstrating robustness under domain shifts. The work also provides audio-enhanced simulation environments and an open-source pathway, highlighting the practical impact of multimodal perception for reliable, contact-rich robotic manipulation.

Abstract

The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA overcomes the vision-only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLA's superior performance over vision-only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities.

Paper Structure

This paper contains 20 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Unlike VLA models, Audio-VLA incorporates audio perception, enabling better assessment of contact states and understanding of manipulation dynamics.
  • Figure 2: Architecture of Audio-VLA. The model consists of multi-modal encoders including audio, vision, and proprioceptive modules, multi-modal Projector that map heterogeneous features to a unified representation space, a 7B-parameter Llama2 language model as backbone, and a four-layer MLP action head for continuous action generation.
  • Figure 3: Experimental setup showing the hardware platform and real-world manipulation tasks