VITAL: Interactive Few-Shot Imitation Learning via Visual Human-in-the-Loop Corrections
Hamidreza Kasaei, Mohammadreza Kasaei
TL;DR
This work tackles the data inefficiency of imitation learning for robotic manipulation by introducing VITAL, a low-cost visual teleoperation and simulation-based data augmentation framework that leverages a digital twin and human-in-the-loop corrections to scale a handful of demonstrations into large, robust training sets. The approach combines trajectory-level augmentation, hierarchical policy learning, and residual learning with HITL feedback to bridge the sim-to-real gap and improve long-horizon manipulation across real and simulated environments. Key contributions include a scalable teleoperation data-collection interface, a digitized augmentation pipeline that preserves task structure, and a balanced training regimen that integrates real demonstrations with augmented simulations to achieve strong generalization to new tasks and objects. The results demonstrate practical impact in reducing data collection costs while achieving high task success rates and demonstrating generalization to novel tasks like setting a drink tray, albeit with remaining challenges in precise real-world execution that motivate future closed-loop vision feedback and more dynamic HITL strategies.
Abstract
Imitation Learning (IL) has emerged as a powerful approach in robotics, allowing robots to acquire new skills by mimicking human actions. Despite its potential, the data collection process for IL remains a significant challenge due to the logistical difficulties and high costs associated with obtaining high-quality demonstrations. To address these issues, we propose a large-scale data generation from a handful of demonstrations through data augmentation in simulation. Our approach leverages affordable hardware and visual processing techniques to collect demonstrations, which are then augmented to create extensive training datasets for imitation learning. By utilizing both real and simulated environments, along with human-in-the-loop corrections, we enhance the generalizability and robustness of the learned policies. We evaluated our method through several rounds of experiments in both simulated and real-robot settings, focusing on tasks of varying complexity, including bottle collecting, stacking objects, and hammering. Our experimental results validate the effectiveness of our approach in learning robust robot policies from simulated data, significantly improved by human-in-the-loop corrections and real-world data integration. Additionally, we demonstrate the framework's capability to generalize to new tasks, such as setting a drink tray, showcasing its adaptability and potential for handling a wide range of real-world manipulation tasks. A video of the experiments can be found at: https://youtu.be/YeVAMRqRe64?si=R179xDlEGc7nPu8i
