Sketch-MoMa: Teleoperation for Mobile Manipulator via Interpretation of Hand-Drawn Sketches
Kosei Tanada, Yuka Iwanaga, Masayoshi Tsuchinaga, Yuji Nakamura, Takemitsu Mori, Remi Sakai, Takashi Yamamoto
TL;DR
Sketch-MoMa tackles the need for accessible teleoperation of mobile manipulators using common 2D devices by grounding hand-drawn sketches on observation images with Vision-Language Models to infer downstream tasks and sketch shapes. It integrates VLM-based grounding with perception-driven object detection, grasp pose selection, and trajectory planning to enable both manipulation and navigation, with a user-in-the-loop feedback mechanism. The approach is validated across 7 tasks and 5 sketch shapes, showing competitive usability relative to a 2D baseline and high real-world task reliability for several actions, while revealing challenges in multi-shape sketches and real-time responsiveness. Overall, the work demonstrates that simple sketches, when interpreted by foundation models, can provide intuitive and actionable guidance for mobile manipulation, with clear avenues for robustness and speed improvements in real-world deployment.
Abstract
To use assistive robots in everyday life, a remote control system with common devices, such as 2D devices, is helpful to control the robots anytime and anywhere as intended. Hand-drawn sketches are one of the intuitive ways to control robots with 2D devices. However, since similar sketches have different intentions from scene to scene, existing work needs additional modalities to set the sketches' semantics. This requires complex operations for users and leads to decreasing usability. In this paper, we propose Sketch-MoMa, a teleoperation system using the user-given hand-drawn sketches as instructions to control a robot. We use Vision-Language Models (VLMs) to understand the user-given sketches superimposed on an observation image and infer drawn shapes and low-level tasks of the robot. We utilize the sketches and the generated shapes for recognition and motion planning of the generated low-level tasks for precise and intuitive operations. We validate our approach using state-of-the-art VLMs with 7 tasks and 5 sketch shapes. We also demonstrate that our approach effectively specifies the detailed motions, such as how to grasp and how much to rotate. Moreover, we show the competitive usability of our approach compared with the existing 2D interface through a user experiment with 14 participants.
