Imitation Learning-based Direct Visual Servoing using the Large Projection Formulation

Sayantan Auddy; Antonio Paolillo; Justus Piater; Matteo Saveriano

Imitation Learning-based Direct Visual Servoing using the Large Projection Formulation

Sayantan Auddy, Antonio Paolillo, Justus Piater, Matteo Saveriano

TL;DR

This work tackles robust direct visual servoing in unstructured environments by marrying off-the-shelf DL perception with imitation-learning trajectories within a large projection control framework. The approach, called ildvs, uses a frozen DL detector (YOLO) to extract visual features and a Neural Ordinary Differential Equation (NODE) to generate corrective velocities learned from demonstrations, ensuring convergence to a target while enabling complex motions. Real-robot experiments on a Franka Panda with mouse and cup tasks show ildvs outperforms purely DL-based or purely imitation-based baselines, handling novel object positions and clutter and achieving high success in dropping objects into a cup. The method offers a modular perception-control integration with stability guarantees and is available as open-source.

Abstract

Today robots must be safe, versatile, and user-friendly to operate in unstructured and human-populated environments. Dynamical system-based imitation learning enables robots to perform complex tasks stably and without explicit programming, greatly simplifying their real-world deployment. To exploit the full potential of these systems it is crucial to implement closed loops that use visual feedback. Vision permits to cope with environmental changes, but is complex to handle due to the high dimension of the image space. This study introduces a dynamical system-based imitation learning for direct visual servoing. It leverages off-the-shelf deep learning-based perception modules to extract robust features from the raw input image, and an imitation learning strategy to execute sophisticated robot motions. The learning blocks are integrated using the large projection task priority formulation. As demonstrated through extensive experimental analysis, the proposed method realizes complex tasks with a robotic manipulator.

Imitation Learning-based Direct Visual Servoing using the Large Projection Formulation

TL;DR

Abstract

Paper Structure (18 sections, 12 equations, 14 figures, 1 table)

This paper contains 18 sections, 12 equations, 14 figures, 1 table.

Introduction
Related work
Background
Approach
DL-based detection and its limits
Overcoming the detection limits through imitation
Merging DL and IL with the large projector
Experimental setup
Hardware and software components
YOLO detector
Collection of demonstrations
NODE training
Evaluation protocol
Experimental results
Centering the mouse in the image
...and 3 more sections

Figures (14)

Figure 1: Our work combines off-the-shelf deep learning strategies to detect objects in the clutter, and imitation learning to realize complicated trajectories, e.g., dropping a cube into a cup on an untidy table. The large projection formulation combines the two machine learning components and ensures convergence to a given target.
Figure 2: State-of-the-art dl-based systems like YOLO can be used to detect the features of an object of interest on a raw monocular image robustly. Examples of features are the vertices (denoted with 'C1', 'C2', 'C3', and 'C4') of the bounding box detected around the image of a cup. However, such detection systems fail to capture the correct object orientation. In the three snapshots, YOLO provides very similar feature values that correspond to three very different relative camera-object orientations producing a side (a), oblique (b), and top view (c) of the cup.
Figure 3: The proposed framework for ildvs exploits a detection model (a frozen dl network, implemented by YOLO) to extract features from raw images robustly and il (implemented as a fine-tuned NODE network) to realize complex trajectories and overcome the limitation of the detection model. The large projection formulation merges the output of the detection and imitation strategy in a closed-loop control law resulting in accurate and converging robot movements.
Figure 4: Initial (left) and final images (right) captured by the robot camera in the experiments with the mouse (top) and the cup (bottom). Desired visual features are shown in blue and denoted with the letter "G", whereas the current visual features are the green letters "C".
Figure 5: Position trajectories of demonstrations provided for the "Centering the mouse in the image" (left) and "Dropping an object in the cup" (right) tasks. Diversity is introduced by starting from different initial poses and also through the differences between each kinesthetic demonstration.
...and 9 more figures

Imitation Learning-based Direct Visual Servoing using the Large Projection Formulation

TL;DR

Abstract

Imitation Learning-based Direct Visual Servoing using the Large Projection Formulation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)