End-to-end Multi-Instance Robotic Reaching from Monocular Vision

Zheyu Zhuang; Xin Yu; Robert Mahony

End-to-end Multi-Instance Robotic Reaching from Monocular Vision

Zheyu Zhuang, Xin Yu, Robert Mahony

TL;DR

End-to-end visuomotor control for scenes with multiple identical objects is challenging due to visual ambiguity. The authors propose a real-time monocular RGB plus joint-angle FCN that densely predicts per-grid-cell controls and a regressed control-Lyapunov function value $\mathcal{V}$, using the lowest $\widehat{\mathcal{V}}$ to select actions and drive grasping. A symmetry-aware $cLf$ on $SE(3)$ with a velocity controller guarantees Lyapunov decrease, while the grid-based architecture with CoordConv enables robust multi-instance handling and dynamic scene adaptation. Trained entirely in simulation with domain randomization, the approach achieves up to ~160 fps and a real-world grasp success of $\approx 92.8\%$ across categories, demonstrating strong sim-to-real transfer and real-time performance without pose-detection pipelines.

Abstract

Multi-instance scenes are especially challenging for end-to-end visuomotor (image-to-control) learning algorithms. "Pipeline" visual servo control algorithms use separate detection, selection and servo stages, allowing algorithms to focus on a single object instance during servo control. End-to-end systems do not have separate detection and selection stages and need to address the visual ambiguities introduced by the presence of arbitrary number of visually identical or similar objects during servo control. However, end-to-end schemes avoid embedding errors from detection and selection stages in the servo control behaviour, are more dynamically robust to changing scenes, and are algorithmically simpler. In this paper, we present a real-time end-to-end visuomotor learning algorithm for multi-instance reaching. The proposed algorithm uses a monocular RGB image and the manipulator's joint angles as the input to a light-weight fully-convolutional network (FCN) to generate control candidates. A key innovation of the proposed method is identifying the optimal control candidate by regressing a control-Lyapunov function (cLf) value. The multi-instance capability emerges naturally from the stability analysis associated with the cLf formulation. We demonstrate the proposed algorithm effectively reaching and grasping objects from different categories on a table-top amid other instances and distractors from an over-the-shoulder monocular RGB camera. The network is able to run up to approximately 160 fps during inference on one GTX 1080 Ti GPU.

End-to-end Multi-Instance Robotic Reaching from Monocular Vision

TL;DR

, using the lowest

to select actions and drive grasping. A symmetry-aware

with a velocity controller guarantees Lyapunov decrease, while the grid-based architecture with CoordConv enables robust multi-instance handling and dynamic scene adaptation. Trained entirely in simulation with domain randomization, the approach achieves up to ~160 fps and a real-world grasp success of

across categories, demonstrating strong sim-to-real transfer and real-time performance without pose-detection pipelines.

Abstract

Paper Structure (14 sections, 10 equations, 4 figures, 1 table)

This paper contains 14 sections, 10 equations, 4 figures, 1 table.

INTRODUCTION
Formulation
Symmetry-aware Control Lyapunov Function
Velocity Controller Design
Learning the Control Lyapunov Function
Network Architecture
Loss Functions
Non-optimal Suppression
Implementation
Data Collection
Network Training Details
Grasping Experiments
Ablation Study
Conclusion

Figures (4)

Figure 1: Architecture of the proposed closed-loop reaching algorithm. A fully-convolutional network densely predicts a control Lyapunov function (cLf) value $\widehat{\mathcal{V}}$ and control $\widehat{u}$ associated to each foreground image grid cell. Non-optimal suppression is achieved by selecting the control associated with the grid cell corresponding to the lowest cLf value. The control is updated in real-time as the image and joint angles are updated. The reaching trajectory terminates when the regressed Lyapunov value is lower than a threshold.
Figure 2: (a) The proposed network architecture. (b) Visualisation of the dynamic robustness of the reaching performance. The robot is undertaking a real-time reaching trajectory, however, it is stopped every 3s and the scene is rearranged. The Lyapunov value of the image grid cell is coded by colour as shown by the colour-bar. Initially, target instance 1 has the lowest Lyapunov value and the reaching trajectory is focusing on this instance. The control is unchanged with the addition of another target instance 3. After target 1 is removed, the reaching trajectory refocuses on target instance 2. The introduction of any extra instances or distractors makes no impact on the successful grasp achieved at time $t = 9s$.
Figure 3: Lab and simulation environments: The first-person camera is positioned as shown in Fig. \ref{['fig:lab_setup:real']} (marked with the white circle), pointing towards the table workspace. The simulated environment is geometrically identical to the physical layout in the lab. The simulated camera is calibrated to simulate the real camera.
Figure 4: Visualisation of 32 reaching trajectories for two networks trained with and without CoordConv for the "multi-spam" dataset. For a fair comparison, experiments with each network share the same pre-sampled random initial end-effector poses. The test scene is static, and contains one Spam Can and distractors. The vertical axis indicates the control regression loss $\mathcal{L}_\text{ctrl}$ defined in Eq. \ref{['eq:loss_reg']} .

End-to-end Multi-Instance Robotic Reaching from Monocular Vision

TL;DR

Abstract

End-to-end Multi-Instance Robotic Reaching from Monocular Vision

Authors

TL;DR

Abstract

Table of Contents

Figures (4)