Table of Contents
Fetching ...

Sketch-MoMa: Teleoperation for Mobile Manipulator via Interpretation of Hand-Drawn Sketches

Kosei Tanada, Yuka Iwanaga, Masayoshi Tsuchinaga, Yuji Nakamura, Takemitsu Mori, Remi Sakai, Takashi Yamamoto

TL;DR

Sketch-MoMa tackles the need for accessible teleoperation of mobile manipulators using common 2D devices by grounding hand-drawn sketches on observation images with Vision-Language Models to infer downstream tasks and sketch shapes. It integrates VLM-based grounding with perception-driven object detection, grasp pose selection, and trajectory planning to enable both manipulation and navigation, with a user-in-the-loop feedback mechanism. The approach is validated across 7 tasks and 5 sketch shapes, showing competitive usability relative to a 2D baseline and high real-world task reliability for several actions, while revealing challenges in multi-shape sketches and real-time responsiveness. Overall, the work demonstrates that simple sketches, when interpreted by foundation models, can provide intuitive and actionable guidance for mobile manipulation, with clear avenues for robustness and speed improvements in real-world deployment.

Abstract

To use assistive robots in everyday life, a remote control system with common devices, such as 2D devices, is helpful to control the robots anytime and anywhere as intended. Hand-drawn sketches are one of the intuitive ways to control robots with 2D devices. However, since similar sketches have different intentions from scene to scene, existing work needs additional modalities to set the sketches' semantics. This requires complex operations for users and leads to decreasing usability. In this paper, we propose Sketch-MoMa, a teleoperation system using the user-given hand-drawn sketches as instructions to control a robot. We use Vision-Language Models (VLMs) to understand the user-given sketches superimposed on an observation image and infer drawn shapes and low-level tasks of the robot. We utilize the sketches and the generated shapes for recognition and motion planning of the generated low-level tasks for precise and intuitive operations. We validate our approach using state-of-the-art VLMs with 7 tasks and 5 sketch shapes. We also demonstrate that our approach effectively specifies the detailed motions, such as how to grasp and how much to rotate. Moreover, we show the competitive usability of our approach compared with the existing 2D interface through a user experiment with 14 participants.

Sketch-MoMa: Teleoperation for Mobile Manipulator via Interpretation of Hand-Drawn Sketches

TL;DR

Sketch-MoMa tackles the need for accessible teleoperation of mobile manipulators using common 2D devices by grounding hand-drawn sketches on observation images with Vision-Language Models to infer downstream tasks and sketch shapes. It integrates VLM-based grounding with perception-driven object detection, grasp pose selection, and trajectory planning to enable both manipulation and navigation, with a user-in-the-loop feedback mechanism. The approach is validated across 7 tasks and 5 sketch shapes, showing competitive usability relative to a 2D baseline and high real-world task reliability for several actions, while revealing challenges in multi-shape sketches and real-time responsiveness. Overall, the work demonstrates that simple sketches, when interpreted by foundation models, can provide intuitive and actionable guidance for mobile manipulation, with clear avenues for robustness and speed improvements in real-world deployment.

Abstract

To use assistive robots in everyday life, a remote control system with common devices, such as 2D devices, is helpful to control the robots anytime and anywhere as intended. Hand-drawn sketches are one of the intuitive ways to control robots with 2D devices. However, since similar sketches have different intentions from scene to scene, existing work needs additional modalities to set the sketches' semantics. This requires complex operations for users and leads to decreasing usability. In this paper, we propose Sketch-MoMa, a teleoperation system using the user-given hand-drawn sketches as instructions to control a robot. We use Vision-Language Models (VLMs) to understand the user-given sketches superimposed on an observation image and infer drawn shapes and low-level tasks of the robot. We utilize the sketches and the generated shapes for recognition and motion planning of the generated low-level tasks for precise and intuitive operations. We validate our approach using state-of-the-art VLMs with 7 tasks and 5 sketch shapes. We also demonstrate that our approach effectively specifies the detailed motions, such as how to grasp and how much to rotate. Moreover, we show the competitive usability of our approach compared with the existing 2D interface through a user experiment with 14 participants.
Paper Structure (23 sections, 1 equation, 6 figures, 6 tables)

This paper contains 23 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We propose Sketch-MoMa: easy and intuitive teleoperation for mobile manipulation with hand-drawn sketches by combining task planning with a VLM and motion planning.
  • Figure 2: Overview of Sketch-MoMa. Users can interact with a mobile manipulator via a 2D interface by some widgets and drawing sketches on the canvas (left above). We bridge sketch instructions to robot control by implementing task planning with a VLM and motion planning with user-given sketches.
  • Figure 3: Key text descriptions to ground a VLM to understand sketches.
  • Figure 4: In motion planning, we detect objects and their poses based on tasks, given sketches, and their shapes. We then plan the end-effector trajectory to the target objects. The robot asks for feedback from the user after reaching the target objects.
  • Figure 5: Variation of sketches for the instruction of detailed motions. We provide 4 directions for grasping with a U-shape and two rotations with an arrow shape. We set numbers on objects with SoM yang2023setofmarkpromptingunleashesextraordinary for VoxPoser huang2023voxposercomposable3dvalue to make VLMs understand the specified objects visually.
  • ...and 1 more figures