Table of Contents
Fetching ...

Imitation Learning-based Direct Visual Servoing using the Large Projection Formulation

Sayantan Auddy, Antonio Paolillo, Justus Piater, Matteo Saveriano

TL;DR

This work tackles robust direct visual servoing in unstructured environments by marrying off-the-shelf DL perception with imitation-learning trajectories within a large projection control framework. The approach, called ildvs, uses a frozen DL detector (YOLO) to extract visual features and a Neural Ordinary Differential Equation (NODE) to generate corrective velocities learned from demonstrations, ensuring convergence to a target while enabling complex motions. Real-robot experiments on a Franka Panda with mouse and cup tasks show ildvs outperforms purely DL-based or purely imitation-based baselines, handling novel object positions and clutter and achieving high success in dropping objects into a cup. The method offers a modular perception-control integration with stability guarantees and is available as open-source.

Abstract

Today robots must be safe, versatile, and user-friendly to operate in unstructured and human-populated environments. Dynamical system-based imitation learning enables robots to perform complex tasks stably and without explicit programming, greatly simplifying their real-world deployment. To exploit the full potential of these systems it is crucial to implement closed loops that use visual feedback. Vision permits to cope with environmental changes, but is complex to handle due to the high dimension of the image space. This study introduces a dynamical system-based imitation learning for direct visual servoing. It leverages off-the-shelf deep learning-based perception modules to extract robust features from the raw input image, and an imitation learning strategy to execute sophisticated robot motions. The learning blocks are integrated using the large projection task priority formulation. As demonstrated through extensive experimental analysis, the proposed method realizes complex tasks with a robotic manipulator.

Imitation Learning-based Direct Visual Servoing using the Large Projection Formulation

TL;DR

This work tackles robust direct visual servoing in unstructured environments by marrying off-the-shelf DL perception with imitation-learning trajectories within a large projection control framework. The approach, called ildvs, uses a frozen DL detector (YOLO) to extract visual features and a Neural Ordinary Differential Equation (NODE) to generate corrective velocities learned from demonstrations, ensuring convergence to a target while enabling complex motions. Real-robot experiments on a Franka Panda with mouse and cup tasks show ildvs outperforms purely DL-based or purely imitation-based baselines, handling novel object positions and clutter and achieving high success in dropping objects into a cup. The method offers a modular perception-control integration with stability guarantees and is available as open-source.

Abstract

Today robots must be safe, versatile, and user-friendly to operate in unstructured and human-populated environments. Dynamical system-based imitation learning enables robots to perform complex tasks stably and without explicit programming, greatly simplifying their real-world deployment. To exploit the full potential of these systems it is crucial to implement closed loops that use visual feedback. Vision permits to cope with environmental changes, but is complex to handle due to the high dimension of the image space. This study introduces a dynamical system-based imitation learning for direct visual servoing. It leverages off-the-shelf deep learning-based perception modules to extract robust features from the raw input image, and an imitation learning strategy to execute sophisticated robot motions. The learning blocks are integrated using the large projection task priority formulation. As demonstrated through extensive experimental analysis, the proposed method realizes complex tasks with a robotic manipulator.
Paper Structure (18 sections, 12 equations, 14 figures, 1 table)

This paper contains 18 sections, 12 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Our work combines off-the-shelf deep learning strategies to detect objects in the clutter, and imitation learning to realize complicated trajectories, e.g., dropping a cube into a cup on an untidy table. The large projection formulation combines the two machine learning components and ensures convergence to a given target.
  • Figure 2: State-of-the-art dl-based systems like YOLO can be used to detect the features of an object of interest on a raw monocular image robustly. Examples of features are the vertices (denoted with 'C1', 'C2', 'C3', and 'C4') of the bounding box detected around the image of a cup. However, such detection systems fail to capture the correct object orientation. In the three snapshots, YOLO provides very similar feature values that correspond to three very different relative camera-object orientations producing a side (a), oblique (b), and top view (c) of the cup.
  • Figure 3: The proposed framework for ildvs exploits a detection model (a frozen dl network, implemented by YOLO) to extract features from raw images robustly and il (implemented as a fine-tuned NODE network) to realize complex trajectories and overcome the limitation of the detection model. The large projection formulation merges the output of the detection and imitation strategy in a closed-loop control law resulting in accurate and converging robot movements.
  • Figure 4: Initial (left) and final images (right) captured by the robot camera in the experiments with the mouse (top) and the cup (bottom). Desired visual features are shown in blue and denoted with the letter "G", whereas the current visual features are the green letters "C".
  • Figure 5: Position trajectories of demonstrations provided for the "Centering the mouse in the image" (left) and "Dropping an object in the cup" (right) tasks. Diversity is introduced by starting from different initial poses and also through the differences between each kinesthetic demonstration.
  • ...and 9 more figures