Table of Contents
Fetching ...

SafeHumanoid: VLM-RAG-driven Control of Upper Body Impedance for Humanoid Robot

Yara Mahmoud, Jeffrin Sam, Nguyen Khang, Marcelino Fernando, Issatay Tokmurziyev, Miguel Altamirano Cabrera, Muhammad Haris Khan, Artem Lykov, Dzmitry Tsetserukou

TL;DR

SafeHumanoid presents a VLM–RAG pipeline that grounds egocentric vision into context-aware impedance and speed parameters for a humanoid robot. By retrieving validated per-joint gains and nominal velocity from a curated scenario database and applying them through an IK-based controller, the approach provides a semantic-to-safety bridge that enhances safe-human collaboration. Experiments on the Unitree G1 show task success is preserved while safety-aware modulation adapts to human presence and object fragility, though offboard latency limits responsiveness in dynamic settings. The work demonstrates a practical path toward standard-compliant, semantics-driven safety in humanoid HRI and outlines concrete avenues for latency reduction and dataset expansion.

Abstract

Safe and trustworthy Human Robot Interaction (HRI) requires robots not only to complete tasks but also to regulate impedance and speed according to scene context and human proximity. We present SafeHumanoid, an egocentric vision pipeline that links Vision Language Models (VLMs) with Retrieval-Augmented Generation (RAG) to schedule impedance and velocity parameters for a humanoid robot. Egocentric frames are processed by a structured VLM prompt, embedded and matched against a curated database of validated scenarios, and mapped to joint-level impedance commands via inverse kinematics. We evaluate the system on tabletop manipulation tasks with and without human presence, including wiping, object handovers, and liquid pouring. The results show that the pipeline adapts stiffness, damping, and speed profiles in a context-aware manner, maintaining task success while improving safety. Although current inference latency (up to 1.4 s) limits responsiveness in highly dynamic settings, SafeHumanoid demonstrates that semantic grounding of impedance control is a viable path toward safer, standard-compliant humanoid collaboration.

SafeHumanoid: VLM-RAG-driven Control of Upper Body Impedance for Humanoid Robot

TL;DR

SafeHumanoid presents a VLM–RAG pipeline that grounds egocentric vision into context-aware impedance and speed parameters for a humanoid robot. By retrieving validated per-joint gains and nominal velocity from a curated scenario database and applying them through an IK-based controller, the approach provides a semantic-to-safety bridge that enhances safe-human collaboration. Experiments on the Unitree G1 show task success is preserved while safety-aware modulation adapts to human presence and object fragility, though offboard latency limits responsiveness in dynamic settings. The work demonstrates a practical path toward standard-compliant, semantics-driven safety in humanoid HRI and outlines concrete avenues for latency reduction and dataset expansion.

Abstract

Safe and trustworthy Human Robot Interaction (HRI) requires robots not only to complete tasks but also to regulate impedance and speed according to scene context and human proximity. We present SafeHumanoid, an egocentric vision pipeline that links Vision Language Models (VLMs) with Retrieval-Augmented Generation (RAG) to schedule impedance and velocity parameters for a humanoid robot. Egocentric frames are processed by a structured VLM prompt, embedded and matched against a curated database of validated scenarios, and mapped to joint-level impedance commands via inverse kinematics. We evaluate the system on tabletop manipulation tasks with and without human presence, including wiping, object handovers, and liquid pouring. The results show that the pipeline adapts stiffness, damping, and speed profiles in a context-aware manner, maintaining task success while improving safety. Although current inference latency (up to 1.4 s) limits responsiveness in highly dynamic settings, SafeHumanoid demonstrates that semantic grounding of impedance control is a viable path toward safer, standard-compliant humanoid collaboration.

Paper Structure

This paper contains 27 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Egocentric perception and semantic-to-safety pipeline. Left: robot grasping with and without human presence, showing adaptive modulation of $K_p$, $K_d$, and $v$. Right: high-level flow from camera input through VLM–RAG reasoning to impedance control of G1 upper-body joints.
  • Figure 2: SafeHumanoid pipeline architecture. The onboard PC streams egocentric frames and executes impedance control at 50 Hz, while the offboard workstation (Molmo VLM + FAISS-based RAG) grounds scene semantics into validated impedance and velocity parameters retrieved from a curated scenario database.
  • Figure 3: Example of semantic-to-safety adaptation during a fragile object (liquid) handover. (a,b) Without human presence, the system schedules moderate impedance and speed for stable handling. (c,d) With human hands present, stiffness $K_p$ is reduced and damping $K_d$ is increased to ensure compliant interaction and prevent excessive contact forces.