Table of Contents
Fetching ...

RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

Yinpei Dai, Jayjun Lee, Nima Fazeli, Joyce Chai

TL;DR

RACER introduces a scalable data augmentation pipeline that enriches expert demonstrations with recoverable failure trajectories and rich language annotations, paired with a vision-language supervisor and a language-conditioned visuomotor policy. The framework enables online failure analysis and corrective guidance, improving robustness across long-horizon, dynamic-goal, and unseen tasks, with strong sim-to-real transfer demonstrated on RLBench and real Panda experiments. Key contributions include automatic rich language-annotated failure recovery data, a VLM-guided supervisory signal, and empirical evidence that rich language guidance and recovery data outperform state-of-the-art baselines. The work advances practical robotic manipulation by reducing online human intervention and enabling more reliable, adaptable control in both simulated and real-world settings.

Abstract

Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io.

RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

TL;DR

RACER introduces a scalable data augmentation pipeline that enriches expert demonstrations with recoverable failure trajectories and rich language annotations, paired with a vision-language supervisor and a language-conditioned visuomotor policy. The framework enables online failure analysis and corrective guidance, improving robustness across long-horizon, dynamic-goal, and unseen tasks, with strong sim-to-real transfer demonstrated on RLBench and real Panda experiments. Key contributions include automatic rich language-annotated failure recovery data, a VLM-guided supervisory signal, and empirical evidence that rich language guidance and recovery data outperform state-of-the-art baselines. The work advances practical robotic manipulation by reducing online human intervention and enabling more reliable, adaptable control in both simulated and real-world settings.

Abstract

Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io.
Paper Structure (24 sections, 4 figures, 4 tables)

This paper contains 24 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between the simple and rich language guidance for failure recovery: The robot should approach the Oreo (the blue box on the right) directly to grasp it but instead moved to the wrong object (the black box). To help the visuomotor policy recover from this failure, the rich language instruction provides sufficient details, including a failure analysis (in red), spatial movements (in orange) and the expected outcome (in purple). In contrast, simple language instructions with limited descriptions may not guide the robot effectively, potentially causing it to continue making mistakes.
  • Figure 2: An overview of automatic rich language-annotated failure-recovery data augmentation pipeline. Given an expert demo (e.g., task goal: close the olive jar), perturbations are injected to expert actions at crucial keyframes (e.g. aligning to, grasping, and releasing a target object) to induce failures. Then, the expert actions are reused as corrections to collect recovery transitions. Finally, all expert and recovery transitions are labelled with rich instructions through GPT-4-turbo. The input for GPT-4-turbo includes the task description, ground-truth object locations, failure types, and heuristic language describing the change in the end-effector's pose movement at the current step.
  • Figure 3: The RACER framework consists of: (1) the Supervisor, a VLM that monitors the robot's behavior, providing feedback for task execution and error correction with rich instructions; and (2) the Actor, a language-conditioned visuomotor policy that generates actions based on visual observations, proprioceptive states, and language guidance that includes a high-level task goal and an instruction.
  • Figure 4: (a) Comparison of RACER 's performance trained with and without failure recovery across three types of instructions. (b) Cross-evaluation of RACER trained with failure recovery, where training and testing were conducted on varying instruction types.