Table of Contents
Fetching ...

VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang

TL;DR

VLAS introduces an end-to-end vision-language-action model that directly ingests speech for robot manipulation, removing the need for external ASR and preserving non-semantic speech information. It integrates a Whisper-based speech encoder, a CLIP-based vision encoder, and a Voice RAG retrieval mechanism within a LLaMA/Vicuna backbone to generate discretized action tokens that control robots. The approach is supported by two new datasets, SQA and CSI, enabling three-stage speech instruction tuning and evaluation of personalized, multimodal interactions. Results on CALVIN and customized benchmarks show strong performance with speech inputs, and real UR5 experiments demonstrate practical applicability. The work also offers VLAS-Base, extending LLaVA for speech support, and demonstrates that speech-enabled VLA can achieve personalized task execution without sacrificing multimodal understanding.

Abstract

Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.

VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

TL;DR

VLAS introduces an end-to-end vision-language-action model that directly ingests speech for robot manipulation, removing the need for external ASR and preserving non-semantic speech information. It integrates a Whisper-based speech encoder, a CLIP-based vision encoder, and a Voice RAG retrieval mechanism within a LLaMA/Vicuna backbone to generate discretized action tokens that control robots. The approach is supported by two new datasets, SQA and CSI, enabling three-stage speech instruction tuning and evaluation of personalized, multimodal interactions. Results on CALVIN and customized benchmarks show strong performance with speech inputs, and real UR5 experiments demonstrate practical applicability. The work also offers VLAS-Base, extending LLaVA for speech support, and demonstrates that speech-enabled VLA can achieve personalized task execution without sacrificing multimodal understanding.

Abstract

Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.

Paper Structure

This paper contains 19 sections, 3 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: For personalization tasks, (a) previous VLAs with text instructions fail, while (b) our VLAS with speech instructions could successfully address them.
  • Figure 2: Overall Framework of VLAS. VLAS encodes visual and speech inputs via encoders and MLP layers to obtain respective embeddings. The Voice RAG module retrieves personalized knowledge based on speaker identification and converts it into embeddings using a text tokenizer. All embeddings are then processed by LLaMA to generate action tokens, which are subsequently detokenized into continuous values to control the robot's movements.
  • Figure 3: Data collection process for the SQA and CSI datasets.
  • Figure 4: Training paradigm of VLAS. The training process of VLAS is divided into three stages. Stage I: Speech Alignment, where the model aligns speech with text through MLP fine-tuning. Stage II: Speech Question Answering, where the model is trained on both speech and visual question answering tasks to facilitate comprehension of multimodal inputs. Stage III: Robot Manipulation Fine-tuning, where the model is further fine-tuned to execute robot manipulation tasks using both speech and text instructions.
  • Figure 5: Demonstration of object ownership tasks (top row) and user preference tasks (bottom row) for customized robot manipulation.
  • ...and 4 more figures