See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

Petr Vanc; Jan Kristof Behrens; Václav Hlaváč; Karla Stepanova

See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

Petr Vanc, Jan Kristof Behrens, Václav Hlaváč, Karla Stepanova

TL;DR

This paper presents See&Switch, an interactive teaching-and-execution framework that represents tasks as user-extendable graphs of skill parts connected via decision states (DS), enabling conditional branching during replay, and integrates kinesthetic teaching, joystick control, and hand gestures via an input-modality-abstraction layer.

Abstract

Programming robots by demonstration (PbD) is an intuitive concept, but scaling it to real-world variability remains a challenge for most current teaching frameworks. Conditional task graphs are very expressive and can be defined incrementally, which fits very well with the PbD idea. However, acting using conditional task graphs requires reliable perception-grounded online branch selection. In this paper, we present See & Switch, an interactive teaching-and-execution framework that represents tasks as user-extendable graphs of skill parts connected via decision states (DS), enabling conditional branching during replay. Unlike prior approaches that rely on manual branching or low-dimensional signals (e.g., proprioception), our vision-based Switcher uses eye-in-hand images (high-dimensional) to select among competing successor skill parts and to detect out-of-distribution contexts that require new demonstrations. We integrate kinesthetic teaching, joystick control, and hand gestures via an input-modality-abstraction layer and demonstrate that our proposed method is teaching modality-independent, enabling efficient in-situ recovery demonstrations. The system is validated in experiments on three challenging dexterous manipulation tasks. We evaluate our method under diverse conditions and furthermore conduct user studies with 8 participants. We show that the proposed method reliably performs branch selection and anomaly detection for novice users, achieving 90.7 % and 87.9 % accuracy, respectively, across 576 real-robot rollouts. We provide all code and data required to reproduce our experiments at http://imitrob.ciirc.cvut.cz/publications/seeandswitch.

See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

TL;DR

Abstract

Paper Structure (39 sections, 3 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 39 sections, 3 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Method
Robot policy
Task execution
Human–Robot Interaction Workflow
Anomaly event
Branch (recovery behavior)
Refine
Input–Modality Layer (Modality–Agnostic API)
Observation-based State Evaluator (The Switcher)
Anomaly detector
Context window
Permitted skill parts available for switching
State Estimator
...and 24 more sections

Figures (9)

Figure 1: Interactive robot teaching framework. The user requests to wrap the cable. The user teaches the robot a task using either a (A) kinesthetic teaching, (B) joystick, or (C) hand gestures (blue background). During execution, a robotic trajectory is replayed, and the ★ marks a decision state (DS). At this point, the system may select the most suitable successor skill part $s_0$ (option 2)) or $s_1$ (option 3)) or trigger an anomaly (option 1)) if no previously seen options fit the observation. The Switcher is described in Sec. \ref{['sec:switcher']}.
Figure 2: Task-graph example. Four skill parts ($s_{0,1,2,3}$) form four distinct skill variants. Each skill part has an offset $K_{(i)}$ and terminates at different time steps $t$. Decision state (DS) windows are located around $t=10$ and $t=15$. The task-graph grows online through an increasing number of skill parts via branching and refinement.
Figure 3: Interactive robot teaching & execution framework. Unchanged CIP core: DS logic where user verifies anomaly $a$, and insertion rule (branch). New components: (1) a modality-agnostic input layer (gestures/joystick/kinesthetic) that maps human intent to robot controls, (2) an optional eye-in-hand vision channel. $Z^U$ is a subset of images used for training the Switcher, defined in Sec. \ref{['sec:switcher']}. When the skill part $i^t$ is different from the previous ($i^t \neq i^{t-1}$), we load and extract a new skill part trajectory from the library $\text{Parts}^T$.
Figure 4: Teaching a "peg pick" task with four separate runs.★ is a decision state. You can see eye-in-hand image at DS (timestep $t=49$) and peg visible/absent. (bottom) You can see the likelihoods for two test runs around DS.
Figure 5: Starting states of the environment for three considered tasks.
...and 4 more figures

See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

TL;DR

Abstract

See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

Authors

TL;DR

Abstract

Table of Contents

Figures (9)