Table of Contents
Fetching ...

Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis

Dayou Li, Jiuzhou Lei, Hao Wang, Lulin Liu, Yunhao Yang, Zihan Wang, Bangya Liu, Minghui Zheng, Zhiwen Fan

Abstract

While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.

Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis

Abstract

While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.
Paper Structure (20 sections, 11 equations, 4 figures, 4 tables)

This paper contains 20 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Dream2Fix is a data generation pipeline that synthesizes large-scale, photorealistic failure rollouts with paired corrections from successful demonstrations and curates them with physical and visual verification.
  • Figure 2: Overview of Dream2Fix pipeline. Dream2Fix generates diverse failure cases from successful demonstrations via keyframe-level action perturbations, then validates and curates them with physical and visual verifiers. The verified rollouts are auto-labeled into a structured schema to instruction-tune a VLM that predicts actionable corrections for real-world recovery.
  • Figure 3: Real-world experimental workspace. A Franka Research 3 arm with a Franka Hand gripper, and a fixed-view Intel RealSense D435i RGB-D camera are equipped for real-world evaluation.
  • Figure 4: Real world robot execution. For each task, we show the initial failed execution (left) and the corrected execution (right). Across diverse tasks, the correction adjusts the action to recover from the failure under the same real-robot setup.