Table of Contents
Fetching ...

Show, Don't Tell: Detecting Novel Objects by Watching Human Videos

James Akl, Jose Nicolas Avendano Arbelaez, James Barabas, Jennifer L. Barry, Kalie Ching, Noam Eshed, Jiahui Fu, Michel Hidalgo, Andrew Hoelscher, Tushar Kusnur, Andrew Messing, Zachary Nagler, Brian Okorn, Mauro Passerino, Tim J. Perkins, Eric Rosen, Ankit Shah, Tanmay Shankar, Scott Shaw

Abstract

How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.

Show, Don't Tell: Detecting Novel Objects by Watching Human Videos

Abstract

How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.
Paper Structure (71 sections, 9 figures, 4 tables, 2 algorithms)

This paper contains 71 sections, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 2: Our Salient Objects Dataset Creation (SODC) pipeline: 1. Detecting Grasped Entities: A Human-Object-Interaction detector is used to detect and segment grasped objects in each frame (shown in pink). 2. Tracking Grasped Masks: Entities are tracked over time. Note that each grasp segmentation is used as a seed to a tracking algorithm, resulting in multiple tracks per grasped object, with each track represented as a colored trajectory. 3. Consolidating: Tracks are clustered across space and time to identify individual objects.
  • Figure 3: Track Clustering: Combining multiple bounding box tracks ("Indv. Tracks" with colors corresponding to their bounding boxes) into per-object tracks (highlighted light purple and cyan). First, the bounding boxes in each frame are spatially clustered (F1C1, F1C2, etc.). Then the tracks are temporally grouped if they traverse the same or highly similar sequences of spatial clusters. Note that although five bounding boxes are clustered spatially at t=30, they are grouped into two final temporal groups because those bounding boxes are not clustered together in all frames. The short magenta track is discarded as noise because its temporal group does not include enough tracks.
  • Figure 4: The on-robot application flow: (1) The participants make novel objects and demonstrate a sort to the camera. From the video the robot generates (2) a plan skeleton, (3) a salient objects dataset, and a manipulated objects detector. (4) The robot executes the sort. Note that the basket the robot uses is not identical to the basket used in the human demonstration since we use a VLM to recognize objects the human does not manipulate directly.
  • Figure 5: Examples of objects constructed by human participants. Red text are human generated prompts from which VLMs could not detect the large object while green are successful prompts, generated iteratively in a single labeling session. The blue text is the prompt generated by Chat GPT-4o.
  • Figure 6: A timelapse of the on-robot application showing the human demonstration and the robot's execution. The full demonstration and execution had four picks+places.
  • ...and 4 more figures