Table of Contents
Fetching ...

iTeach: Interactive Teaching for Robot Perception using Mixed Reality

Jishnu Jaykumar P, Cole Salvato, Vinaya Bomnale, Jikai Wang, Yu Xiang

TL;DR

iTeach addresses the challenge of robust robot perception in open-world environments by enabling real-time, human-in-the-loop refinement via a mixed-reality interface. The system collects lightweight annotations during deployment using eye gaze and voice, propagates sparse labels through short RGB-D videos with SAM2, and performs on-the-fly fine-tuning to produce updated perception models. The approach is instantiated on Unseen Object Instance Segmentation (UOIS) with MSMFormer as a baseline, showing consistent improvements on both a real dataset and the SceneReplica benchmark, and translating to better grasping performance in real-world manipulation with a Fetch robot. This work demonstrates a practical, in-situ pathway to robust, generalizable perception by tightly coupling human guidance, efficient annotation, and rapid model updates, with potential extension to manipulation skills.

Abstract

Robots deployed in the wild often encounter objects and scenes that break pre-trained perception models, yet adapting these models typically requires slow offline data collection, labeling, and retraining. We introduce iTeach, a human-in-the-loop system that enables robots to improve perception continuously as they explore new environments. A human sees the robot's predictions from its own viewpoint, corrects failures in real time, and the informed data drives iterative fine-tuning until performance is satisfactory. A mixed reality headset provides the interface, overlaying predictions in the user's view and enabling lightweight annotation via eye gaze and voice. Instead of tedious frame-by-frame labeling, a human guides the robot to scenes of choice and records short videos while interacting with objects. The human labels only the final frame, and a video segmentation model propagates labels across the sequence, converting seconds of input into dense supervision. The refined model is deployed immediately, closing the loop between human feedback and robot learning. We demonstrate iTeach on Unseen Object Instance Segmentation (UOIS), achieving consistent improvements over a pre-trained MSMFormer baseline on both our collected dataset and the SceneReplica benchmark, where it leads to higher grasping success, followed by a real-world demonstration of grasping unseen objects with a Fetch robot. By combining human judgment, efficient annotation, and on-the-fly refinement, iTeach provides a practical path toward perception systems that generalize robustly in diverse real-world conditions. Project page at https://irvlutd.github.io/iTeach

iTeach: Interactive Teaching for Robot Perception using Mixed Reality

TL;DR

iTeach addresses the challenge of robust robot perception in open-world environments by enabling real-time, human-in-the-loop refinement via a mixed-reality interface. The system collects lightweight annotations during deployment using eye gaze and voice, propagates sparse labels through short RGB-D videos with SAM2, and performs on-the-fly fine-tuning to produce updated perception models. The approach is instantiated on Unseen Object Instance Segmentation (UOIS) with MSMFormer as a baseline, showing consistent improvements on both a real dataset and the SceneReplica benchmark, and translating to better grasping performance in real-world manipulation with a Fetch robot. This work demonstrates a practical, in-situ pathway to robust, generalizable perception by tightly coupling human guidance, efficient annotation, and rapid model updates, with potential extension to manipulation skills.

Abstract

Robots deployed in the wild often encounter objects and scenes that break pre-trained perception models, yet adapting these models typically requires slow offline data collection, labeling, and retraining. We introduce iTeach, a human-in-the-loop system that enables robots to improve perception continuously as they explore new environments. A human sees the robot's predictions from its own viewpoint, corrects failures in real time, and the informed data drives iterative fine-tuning until performance is satisfactory. A mixed reality headset provides the interface, overlaying predictions in the user's view and enabling lightweight annotation via eye gaze and voice. Instead of tedious frame-by-frame labeling, a human guides the robot to scenes of choice and records short videos while interacting with objects. The human labels only the final frame, and a video segmentation model propagates labels across the sequence, converting seconds of input into dense supervision. The refined model is deployed immediately, closing the loop between human feedback and robot learning. We demonstrate iTeach on Unseen Object Instance Segmentation (UOIS), achieving consistent improvements over a pre-trained MSMFormer baseline on both our collected dataset and the SceneReplica benchmark, where it leads to higher grasping success, followed by a real-world demonstration of grasping unseen objects with a Fetch robot. By combining human judgment, efficient annotation, and on-the-fly refinement, iTeach provides a practical path toward perception systems that generalize robustly in diverse real-world conditions. Project page at https://irvlutd.github.io/iTeach

Paper Structure

This paper contains 15 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of our iTeach System. (a) A user wearing a Microsoft HoloLens 2 headset can see the object segmentation output from a robot in real time to inspect failures of the segmentation model. (b) The user can interact with objects and annotate images using the MR device. (c) Once a labeled dataset is obtained, our system fine-tunes the perception model with these labeled images. The system continually improves the perception model by collecting annotations of failed examples.
  • Figure 2: System architecture. HoloLens, Robot, and PC are integrated for interactive teaching and data collection.
  • Figure 3: Task-space setup for representative perception tasks. The arrow indicates human-aided navigation to reach informative task spaces. When false or uncertain predictions are observed, the human adjusts the robot viewpoint to obtain improved observations for data collection.
  • Figure 4: Point-based annotation using eye gaze and voice commands. The orange cursor denotes the user’s eye-gaze location, while green dots indicate registered point prompts.
  • Figure 5: Mask propagation from the final annotated frame to earlier frames using bounding-box prompts derived from MR point annotations. Human-guided interaction progressively transforms cluttered scenes into cleaner configurations, enabling robust mask propagation.
  • ...and 3 more figures