Table of Contents
Fetching ...

EXOT: Exit-aware Object Tracker for Safe Robotic Manipulation of Moving Object

Hyunseo Kim, Hye Jung Yoon, Minji Kim, Dong-Sig Han, Byoung-Tak Zhang

TL;DR

EXOT addresses safe robotic manipulation from a hand-mounted camera by coupling a long-term transformer-based tracker with an out-of-distribution (OOD) classifier to detect the absence of the target and trigger conservative actions. The method builds on STARK with three heads—bounding box prediction, template update score, and OOD score—using Generalized ODIN principles to compute $p(y, d_{in}|x)$ and $p(d_{in}|x)$, and applies time-smoothed thresholds for exit decisions. A new RMOT-223 dataset is introduced, alongside comprehensive experiments on TREK-150, ablations, and a real UR5e sushi task, showing up to 38% improvement in exit-awareness over STARK. The work demonstrates practical safety benefits for first-person robotics in dynamic environments and offers a framework adaptable to other robotic domains, albeit with dataset-specific threshold considerations.

Abstract

Current robotic hand manipulation narrowly operates with objects in predictable positions in limited environments. Thus, when the location of the target object deviates severely from the expected location, a robot sometimes responds in an unexpected way, especially when it operates with a human. For safe robot operation, we propose the EXit-aware Object Tracker (EXOT) on a robot hand camera that recognizes an object's absence during manipulation. The robot decides whether to proceed by examining the tracker's bounding box output containing the target object. We adopt an out-of-distribution classifier for more accurate object recognition since trackers can mistrack a background as a target object. To the best of our knowledge, our method is the first approach of applying an out-of-distribution classification technique to a tracker output. We evaluate our method on the first-person video benchmark dataset, TREK-150, and on the custom dataset, RMOT-223, that we collect from the UR5e robot. Then we test our tracker on the UR5e robot in real-time with a conveyor-belt sushi task, to examine the tracker's ability to track target dishes and to determine the exit status. Our tracker shows 38% higher exit-aware performance than a baseline method. The dataset and the code will be released at https://github.com/hskAlena/EXOT.

EXOT: Exit-aware Object Tracker for Safe Robotic Manipulation of Moving Object

TL;DR

EXOT addresses safe robotic manipulation from a hand-mounted camera by coupling a long-term transformer-based tracker with an out-of-distribution (OOD) classifier to detect the absence of the target and trigger conservative actions. The method builds on STARK with three heads—bounding box prediction, template update score, and OOD score—using Generalized ODIN principles to compute and , and applies time-smoothed thresholds for exit decisions. A new RMOT-223 dataset is introduced, alongside comprehensive experiments on TREK-150, ablations, and a real UR5e sushi task, showing up to 38% improvement in exit-awareness over STARK. The work demonstrates practical safety benefits for first-person robotics in dynamic environments and offers a framework adaptable to other robotic domains, albeit with dataset-specific threshold considerations.

Abstract

Current robotic hand manipulation narrowly operates with objects in predictable positions in limited environments. Thus, when the location of the target object deviates severely from the expected location, a robot sometimes responds in an unexpected way, especially when it operates with a human. For safe robot operation, we propose the EXit-aware Object Tracker (EXOT) on a robot hand camera that recognizes an object's absence during manipulation. The robot decides whether to proceed by examining the tracker's bounding box output containing the target object. We adopt an out-of-distribution classifier for more accurate object recognition since trackers can mistrack a background as a target object. To the best of our knowledge, our method is the first approach of applying an out-of-distribution classification technique to a tracker output. We evaluate our method on the first-person video benchmark dataset, TREK-150, and on the custom dataset, RMOT-223, that we collect from the UR5e robot. Then we test our tracker on the UR5e robot in real-time with a conveyor-belt sushi task, to examine the tracker's ability to track target dishes and to determine the exit status. Our tracker shows 38% higher exit-aware performance than a baseline method. The dataset and the code will be released at https://github.com/hskAlena/EXOT.
Paper Structure (14 sections, 3 equations, 6 figures, 2 tables)

This paper contains 14 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An overview of the problem setting. A robot with a hand camera tracks down the target object (highlighted with green) on a moving conveyor-belt. Target objects are not evenly positioned on the conveyor-belt, so empty spots come up frequently. The robot recognizes an abnormal situation (exit) and does not try placing when the target object is not visible on the camera. When the target object is visible on the hand cam, the robot places the object it was holding. A camera frame with the target object is in-distribution (ID) and the frame without the target object is out-of-distribution (OOD).
  • Figure 2: An illustration of the proposed tracker, EXOT. At the start of training, a search image and an initial template are processed with the backbone network and concatenated. Then those are fed into the Transformer Encoder and Decoder sequentially. Three head networks (OOD score head, bounding box (bbox) prediction head, and score head) process inputs described in the diagram and output a classification confidence score, bbox coordinates, and a template update score, respectively. Using the information, a dynamic template is cropped from the search image and used as an input in the next step. The inside networks of OOD score head are described on the right. Further details are explained in Eq. \ref{['eq:logit_ood']}.
  • Figure 3: Dataset examples. (a) shows how objects are positioned in the pick-and-place task. In average, 6 objects are shown in one scene and each video is recorded from the start of robot moving and until the robot picks up one object. (b) is examples from TREK-150 dataset.
  • Figure 4: An illustration of how our robot gets action signal from its hand camera image. The positive x-y-z axis of robot hand is shown in (b). Then, when a hand camera image is shown like (c), we can get the center pixel point of a target object and the ratio between the pixel width of the camera view and the real world width filmed in the camera view. Using the information, the delta distance the robot should move is deduced to have the target object's center point right under its gripper. Real camera image and its grayscaled image with depth center point are shown in (a).
  • Figure 5: Qualitative results on exit prediction in RMOT-223 dataset. (a), (b) are exit prediction by STARK and EXOT, respectively. A green line shows exit prediction from the model and a dotted red line shows the groundtruth exit. The groundtruth binary prediction is 1 when the target is visible and 0 when it exits from the frame.
  • ...and 1 more figures