Table of Contents
Fetching ...

AnyUser: Translating Sketched User Intent into Domestic Robots

Songyuan Yang, Huibin Tan, Kailun Yang, Wenjing Yang, Shaowu Yang

Abstract

We introduce AnyUser, a unified robotic instruction system for intuitive domestic task instruction via free-form sketches on camera images, optionally with language. AnyUser interprets multimodal inputs (sketch, vision, language) as spatial-semantic primitives to generate executable robot actions requiring no prior maps or models. Novel components include multimodal fusion for understanding and a hierarchical policy for robust action generation. Efficacy is shown via extensive evaluations: (1) Quantitative benchmarks on the large-scale dataset showing high accuracy in interpreting diverse sketch-based commands across various simulated domestic scenes. (2) Real-world validation on two distinct robotic platforms, a statically mounted 7-DoF assistive arm (KUKA LBR iiwa) and a dual-arm mobile manipulator (Realman RMC-AIDAL), performing representative tasks like targeted wiping and area cleaning, confirming the system's ability to ground instructions and execute them reliably in physical environments. (3) A comprehensive user study involving diverse demographics (elderly, simulated non-verbal, low technical literacy) demonstrating significant improvements in usability and task specification efficiency, achieving high task completion rates (85.7%-96.4%) and user satisfaction. AnyUser bridges the gap between advanced robotic capabilities and the need for accessible non-expert interaction, laying the foundation for practical assistive robots adaptable to real-world human environments.

AnyUser: Translating Sketched User Intent into Domestic Robots

Abstract

We introduce AnyUser, a unified robotic instruction system for intuitive domestic task instruction via free-form sketches on camera images, optionally with language. AnyUser interprets multimodal inputs (sketch, vision, language) as spatial-semantic primitives to generate executable robot actions requiring no prior maps or models. Novel components include multimodal fusion for understanding and a hierarchical policy for robust action generation. Efficacy is shown via extensive evaluations: (1) Quantitative benchmarks on the large-scale dataset showing high accuracy in interpreting diverse sketch-based commands across various simulated domestic scenes. (2) Real-world validation on two distinct robotic platforms, a statically mounted 7-DoF assistive arm (KUKA LBR iiwa) and a dual-arm mobile manipulator (Realman RMC-AIDAL), performing representative tasks like targeted wiping and area cleaning, confirming the system's ability to ground instructions and execute them reliably in physical environments. (3) A comprehensive user study involving diverse demographics (elderly, simulated non-verbal, low technical literacy) demonstrating significant improvements in usability and task specification efficiency, achieving high task completion rates (85.7%-96.4%) and user satisfaction. AnyUser bridges the gap between advanced robotic capabilities and the need for accessible non-expert interaction, laying the foundation for practical assistive robots adaptable to real-world human environments.

Paper Structure

This paper contains 52 sections, 25 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: AnyUser architecture and runtime workflow. The user provides a third-person photograph $I$, draws sketches $S$ on the image, and may add a short language cue $L$. The sketch is deterministically segmented into an ordered sequence $S_{seq}$ and, together with $I$ and $L$, is encoded by the multimodal model $f_{fuse}$ to yield a runtime representation that conditions the hierarchical policy $\pi_{HL}$. For each segment the policy produces a high-level command $a'*k$ from the discrete set $A_{disc}$. During execution, egocentric live perception $P_t$ can be injected as an additional image-channel input for reactive checks such as obstacle presence and under-obstacle clearance. The command translation module $g_{translate}$ converts each high-level command into platform-specific multi-DoF control $a_k \in A_{DoF}$, which is executed by the robot’s low-level controllers in a closed loop.
  • Figure 2: Overview of the HouseholdSketch dataset utilized for training and evaluation. Left: A selection of representative images illustrating the visual diversity of indoor environments included. Center: Pie chart depicting the proportional distribution of scene categories within the dataset. Right: Exemplar sketch inputs overlaid on corresponding scene images and shown in isolation.
  • Figure 3: Detailed pipeline of the AnyUser's architecture. The Input Layer receives Visual ($I$), Sketch ($S$), and Linguistic ($L$). After preprocessing, encoders extract modality features. Fusion $\psi_{\text{fuse}}$ aligns sketch with image via cross-modal attention and an MLP, and the hierarchical policy $\pi_{\text{HL}}$ predicts a macro-action per segment. Command translation $g_{\text{translate}}$ converts these macros into platform-specific multi-DoF primitives for execution.
  • Figure 4: Representative robotic platforms relevant to this work. (a) The KUKA LBR iiwa, a 7-DoF collaborative manipulator. (b) The Realman RMC-AIDAL, a dual-arm mobile manipulation platform.
  • Figure 5: Scene-specific task-level performance comparison. Task length categories are defined by sketch complexity (Short: $\leq$ 2 corners, Medium: 3-5 corners, Long: $\geq$ 6 corners). The figure presents (Left) Full Task Completion Rate (FTCR) and (Right) Full Task Strict Path Adherence Rate (FTSPAR) in percentage (%). Results are shown for each scene category from the HouseholdSketch dataset, grouped by task length. Error bars depict simulated standard error, indicating expected performance variability.
  • ...and 4 more figures