iTeach: Interactive Teaching for Robot Perception using Mixed Reality
Jishnu Jaykumar P, Cole Salvato, Vinaya Bomnale, Jikai Wang, Yu Xiang
TL;DR
iTeach addresses the challenge of robust robot perception in open-world environments by enabling real-time, human-in-the-loop refinement via a mixed-reality interface. The system collects lightweight annotations during deployment using eye gaze and voice, propagates sparse labels through short RGB-D videos with SAM2, and performs on-the-fly fine-tuning to produce updated perception models. The approach is instantiated on Unseen Object Instance Segmentation (UOIS) with MSMFormer as a baseline, showing consistent improvements on both a real dataset and the SceneReplica benchmark, and translating to better grasping performance in real-world manipulation with a Fetch robot. This work demonstrates a practical, in-situ pathway to robust, generalizable perception by tightly coupling human guidance, efficient annotation, and rapid model updates, with potential extension to manipulation skills.
Abstract
Robots deployed in the wild often encounter objects and scenes that break pre-trained perception models, yet adapting these models typically requires slow offline data collection, labeling, and retraining. We introduce iTeach, a human-in-the-loop system that enables robots to improve perception continuously as they explore new environments. A human sees the robot's predictions from its own viewpoint, corrects failures in real time, and the informed data drives iterative fine-tuning until performance is satisfactory. A mixed reality headset provides the interface, overlaying predictions in the user's view and enabling lightweight annotation via eye gaze and voice. Instead of tedious frame-by-frame labeling, a human guides the robot to scenes of choice and records short videos while interacting with objects. The human labels only the final frame, and a video segmentation model propagates labels across the sequence, converting seconds of input into dense supervision. The refined model is deployed immediately, closing the loop between human feedback and robot learning. We demonstrate iTeach on Unseen Object Instance Segmentation (UOIS), achieving consistent improvements over a pre-trained MSMFormer baseline on both our collected dataset and the SceneReplica benchmark, where it leads to higher grasping success, followed by a real-world demonstration of grasping unseen objects with a Fetch robot. By combining human judgment, efficient annotation, and on-the-fly refinement, iTeach provides a practical path toward perception systems that generalize robustly in diverse real-world conditions. Project page at https://irvlutd.github.io/iTeach
