Table of Contents
Fetching ...

ImageInThat: Manipulating Images to Convey User Instructions to Robots

Karthik Mahadevan, Blaine Lewis, Jiannan Li, Bilge Mutlu, Anthony Tang, Tovi Grossman

TL;DR

The paper addresses the persistent challenge of instructing robots in real-world tasks where natural language can be ambiguous and traditional end-user programming can be hard to ground. It proposes ImageInThat, a direct image manipulation interface that lets users edit environment images along a timeline to generate robot instructions, leveraging multiple foundation models to caption edits, predict goals, and translate manipulations into executable steps. In a user study with ten participants across four kitchen tasks, ImageInThat produced substantially faster instruction generation (about 64-65% less time) and higher confidence and usability than a text-based baseline, with a real-robot case study demonstrating feasibility. The work demonstrates the viability of multimodal instruction for robotics and points to future directions in blending image and language interfaces, improving perception in-the-wild, and enabling richer back-and-forth human-robot collaboration.

Abstract

Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods-natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/.

ImageInThat: Manipulating Images to Convey User Instructions to Robots

TL;DR

The paper addresses the persistent challenge of instructing robots in real-world tasks where natural language can be ambiguous and traditional end-user programming can be hard to ground. It proposes ImageInThat, a direct image manipulation interface that lets users edit environment images along a timeline to generate robot instructions, leveraging multiple foundation models to caption edits, predict goals, and translate manipulations into executable steps. In a user study with ten participants across four kitchen tasks, ImageInThat produced substantially faster instruction generation (about 64-65% less time) and higher confidence and usability than a text-based baseline, with a real-robot case study demonstrating feasibility. The work demonstrates the viability of multimodal instruction for robotics and points to future directions in blending image and language interfaces, improving perception in-the-wild, and enabling richer back-and-forth human-robot collaboration.

Abstract

Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods-natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/.

Paper Structure

This paper contains 21 sections, 10 figures.

Figures (10)

  • Figure 1: We introduce the direct manipulation of images as a paradigm for providing instructions to a robot. Depicted in the bottom left are a series of instructions that a user is giving the robot by manipulating the fruits. The top shows one trajectory of direct manipulation
  • Figure 2: ImageInThat's user interface, consisting of an editor (top) and a timeline (bottom). The editor allows users to manipulate objects and fixtures in the environment, while the timeline displays the current state of the environment and the desired changes. The timeline (B) shows all instructions to the robot. Selecting a step populates it in the editor (B). Changes between steps are made visible by contrasting changed objects and fixtures from other items (C). ImageInThat automatically captions all manipulations and allows them to be edited. The user can instruct the robot with text to generate new steps automatically. ImageInThat tries to predict user goals such as by proposing locations where objects can be placed (F) or predicting a future step (G).
  • Figure 3: System diagram of ImageInThat showing its major components. The server side handles the preprocessing step and all intelligent features that require interfacing with the LLM (e.g., autocomplete, captioning, and language to step generation. The client is a web user interface built with ReactJS.
  • Figure 4: Tasks performed by participants in the evaluation (left to right): organizing pantry, sorting fruits, cooking stir-fry, and washing dishes. Depicted here is the environment state at the beginning of each task.
  • Figure 5: Left: a plot showing the number of errors for both ImageInThat and the Text system. Errors are broken into the three categories extraneous steps, missing steps, and inefficient steps. A bar also displays the total count of all errors. The middle plot shows the task completion time for all 4 tasks, and the completion time across all tasks. Finally at right are the counts of responses for the NASA-TLX questionnaire. All error bars are bootstrapped 95% confidence intervals.
  • ...and 5 more figures