Table of Contents
Fetching ...

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

Jiyu Lim, Youngwoo Yoon, Kwanghyun Park

Abstract

Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/

The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

Abstract

Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/
Paper Structure (23 sections, 5 equations, 9 figures, 1 table)

This paper contains 23 sections, 5 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of the proposed framework. (a) The robot generates socially appropriate responses to human actions. For example, if a person waves to greet, the robot waves back. If a person dances joyfully, the robot generates an action of clapping and cheering. (b) Given a robot's structural file, the system produces low-level joint control code, which is refined through iterative VLM evaluation for natural and context-appropriate behavior.
  • Figure 2: Overview of the proposed social behavior generation framework. The framework parses robot morphology, plans social behaviors, generates low-level joint commands, and uses a VLM critic to iteratively evaluate and replan actions.
  • Figure 3: Visualization of robot joint range of motion. Full-body and zoomed-in images for positive, zero, and negative values of the Everyday robot's wrist joint.
  • Figure 4: Example of keyframe capture for continuous movement. Keyframes are captured where the angular velocity is zero. The graph below shows the change in wrist angle over time, with red dots indicating the captured moments.
  • Figure 5: Types of robots used in the experiment: Unitree G1, Stretch 3 (left), TIAGo (top right), Open Mini Duck (bottom right). The Everyday robot is shown in Fig. \ref{['fig:joint_viz']}. The proposed method can generate social behaviors for a wide range of robots, from simple ones to humanoids.
  • ...and 4 more figures