Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

Muhamamd Haris Khan; Selamawit Asfaw; Dmitrii Iarchuk; Miguel Altamirano Cabrera; Luis Moreno; Issatay Tokmurziyev; Dzmitry Tsetserukou

Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

Muhamamd Haris Khan, Selamawit Asfaw, Dmitrii Iarchuk, Miguel Altamirano Cabrera, Luis Moreno, Issatay Tokmurziyev, Dzmitry Tsetserukou

TL;DR

Shake-VLA tackles the challenge of end-to-end bimanual robotic manipulation for automated cocktails by unifying vision, language, and action through a VLA model. The system fuses a YOLOV8/EasyOCR visual pipeline, Whisper-1/ gTTS speech interfaces, a FAISS/ada-002 + GPT-4o RAG core, an anomaly detector, and a GPT-4o-based language module to generate precise robotic instructions and manage real-time decision making. Key contributions include a modular, retrieval-driven recipe system, robust anomaly handling for ingredient availability, and a quantitative evaluation showing strong perception and speech performance and a flawless success rate in task completion under tested conditions. This work demonstrates a practical pathway for service robotics, enabling natural human-robot collaboration and scalable, real-time recipe adaptation in dynamic environments.

Abstract

This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text module achieving a 93% success rate in noisy environments, the vision module attaining a 91% success rate in object and label detection in cluttered environment, the anomaly module successfully identified 95% of discrepancies between detected ingredients and recipe requirements, and the system achieved an overall success rate of 100% in preparing cocktails, from recipe formulation to action generation.

Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

TL;DR

Abstract

Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

Authors

TL;DR

Abstract

Table of Contents

Figures (3)