Table of Contents
Fetching ...

VITAL: Interactive Few-Shot Imitation Learning via Visual Human-in-the-Loop Corrections

Hamidreza Kasaei, Mohammadreza Kasaei

TL;DR

This work tackles the data inefficiency of imitation learning for robotic manipulation by introducing VITAL, a low-cost visual teleoperation and simulation-based data augmentation framework that leverages a digital twin and human-in-the-loop corrections to scale a handful of demonstrations into large, robust training sets. The approach combines trajectory-level augmentation, hierarchical policy learning, and residual learning with HITL feedback to bridge the sim-to-real gap and improve long-horizon manipulation across real and simulated environments. Key contributions include a scalable teleoperation data-collection interface, a digitized augmentation pipeline that preserves task structure, and a balanced training regimen that integrates real demonstrations with augmented simulations to achieve strong generalization to new tasks and objects. The results demonstrate practical impact in reducing data collection costs while achieving high task success rates and demonstrating generalization to novel tasks like setting a drink tray, albeit with remaining challenges in precise real-world execution that motivate future closed-loop vision feedback and more dynamic HITL strategies.

Abstract

Imitation Learning (IL) has emerged as a powerful approach in robotics, allowing robots to acquire new skills by mimicking human actions. Despite its potential, the data collection process for IL remains a significant challenge due to the logistical difficulties and high costs associated with obtaining high-quality demonstrations. To address these issues, we propose a large-scale data generation from a handful of demonstrations through data augmentation in simulation. Our approach leverages affordable hardware and visual processing techniques to collect demonstrations, which are then augmented to create extensive training datasets for imitation learning. By utilizing both real and simulated environments, along with human-in-the-loop corrections, we enhance the generalizability and robustness of the learned policies. We evaluated our method through several rounds of experiments in both simulated and real-robot settings, focusing on tasks of varying complexity, including bottle collecting, stacking objects, and hammering. Our experimental results validate the effectiveness of our approach in learning robust robot policies from simulated data, significantly improved by human-in-the-loop corrections and real-world data integration. Additionally, we demonstrate the framework's capability to generalize to new tasks, such as setting a drink tray, showcasing its adaptability and potential for handling a wide range of real-world manipulation tasks. A video of the experiments can be found at: https://youtu.be/YeVAMRqRe64?si=R179xDlEGc7nPu8i

VITAL: Interactive Few-Shot Imitation Learning via Visual Human-in-the-Loop Corrections

TL;DR

This work tackles the data inefficiency of imitation learning for robotic manipulation by introducing VITAL, a low-cost visual teleoperation and simulation-based data augmentation framework that leverages a digital twin and human-in-the-loop corrections to scale a handful of demonstrations into large, robust training sets. The approach combines trajectory-level augmentation, hierarchical policy learning, and residual learning with HITL feedback to bridge the sim-to-real gap and improve long-horizon manipulation across real and simulated environments. Key contributions include a scalable teleoperation data-collection interface, a digitized augmentation pipeline that preserves task structure, and a balanced training regimen that integrates real demonstrations with augmented simulations to achieve strong generalization to new tasks and objects. The results demonstrate practical impact in reducing data collection costs while achieving high task success rates and demonstrating generalization to novel tasks like setting a drink tray, albeit with remaining challenges in precise real-world execution that motivate future closed-loop vision feedback and more dynamic HITL strategies.

Abstract

Imitation Learning (IL) has emerged as a powerful approach in robotics, allowing robots to acquire new skills by mimicking human actions. Despite its potential, the data collection process for IL remains a significant challenge due to the logistical difficulties and high costs associated with obtaining high-quality demonstrations. To address these issues, we propose a large-scale data generation from a handful of demonstrations through data augmentation in simulation. Our approach leverages affordable hardware and visual processing techniques to collect demonstrations, which are then augmented to create extensive training datasets for imitation learning. By utilizing both real and simulated environments, along with human-in-the-loop corrections, we enhance the generalizability and robustness of the learned policies. We evaluated our method through several rounds of experiments in both simulated and real-robot settings, focusing on tasks of varying complexity, including bottle collecting, stacking objects, and hammering. Our experimental results validate the effectiveness of our approach in learning robust robot policies from simulated data, significantly improved by human-in-the-loop corrections and real-world data integration. Additionally, we demonstrate the framework's capability to generalize to new tasks, such as setting a drink tray, showcasing its adaptability and potential for handling a wide range of real-world manipulation tasks. A video of the experiments can be found at: https://youtu.be/YeVAMRqRe64?si=R179xDlEGc7nPu8i
Paper Structure (25 sections, 17 figures, 3 tables)

This paper contains 25 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Overview of the proposed low-cost teleportation interface.
  • Figure 2: Overview of the residual learning based on human-in-the-loop feedback.
  • Figure 3: Three long-horizon evaluation tasks are used. (left) Bottle Collecting; (center) Stacking Pringles; (right) Hammering.
  • Figure 4: (left) Our real dual-arm robot setup; (right) Output of our perception including the bounding box, pose, and label of the objects.
  • Figure 5: Results of Q1 evaluation.
  • ...and 12 more figures