Table of Contents
Fetching ...

Sim2Real View Invariant Visual Servoing by Recurrent Control

Fereshteh Sadeghi, Alexander Toshev, Eric Jang, Sergey Levine

TL;DR

This work addresses visual servoing under large viewpoint variation by learning a view-invariant controller that uses memory to implicitly calibrate how actions influence image-space motion. A recurrent convolutional network is trained in randomized simulation with supervised demonstrations and a value-function objective, then transferred to real hardware by adapting only the visual features. The main contributions include a memory-enabled end-to-end policy, a dual training objective (supervised plus reinforcement learning), and a practical sim-to-real transfer protocol that achieves servoing to unseen objects from novel viewpoints on a Kuka IIWA with modest real-world data. The results show that recurrence and value prediction improve performance over reactive baselines and that lightweight visual adaptation enables effective real-world deployment.

Abstract

Humans are remarkably proficient at controlling their limbs and tools from a wide range of viewpoints and angles, even in the presence of optical distortions. In robotics, this ability is referred to as visual servoing: moving a tool or end-point to a desired location using primarily visual feedback. In this paper, we study how viewpoint-invariant visual servoing skills can be learned automatically in a robotic manipulation scenario. To this end, we train a deep recurrent controller that can automatically determine which actions move the end-point of a robotic arm to a desired object. The problem that must be solved by this controller is fundamentally ambiguous: under severe variation in viewpoint, it may be impossible to determine the actions in a single feedforward operation. Instead, our visual servoing system must use its memory of past movements to understand how the actions affect the robot motion from the current viewpoint, correcting mistakes and gradually moving closer to the target. This ability is in stark contrast to most visual servoing methods, which either assume known dynamics or require a calibration phase. We show how we can learn this recurrent controller using simulated data and a reinforcement learning objective. We then describe how the resulting model can be transferred to a real-world robot by disentangling perception from control and only adapting the visual layers. The adapted model can servo to previously unseen objects from novel viewpoints on a real-world Kuka IIWA robotic arm. For supplementary videos, see: https://fsadeghi.github.io/Sim2RealViewInvariantServo

Sim2Real View Invariant Visual Servoing by Recurrent Control

TL;DR

This work addresses visual servoing under large viewpoint variation by learning a view-invariant controller that uses memory to implicitly calibrate how actions influence image-space motion. A recurrent convolutional network is trained in randomized simulation with supervised demonstrations and a value-function objective, then transferred to real hardware by adapting only the visual features. The main contributions include a memory-enabled end-to-end policy, a dual training objective (supervised plus reinforcement learning), and a practical sim-to-real transfer protocol that achieves servoing to unseen objects from novel viewpoints on a Kuka IIWA with modest real-world data. The results show that recurrence and value prediction improve performance over reactive baselines and that lightweight visual adaptation enables effective real-world deployment.

Abstract

Humans are remarkably proficient at controlling their limbs and tools from a wide range of viewpoints and angles, even in the presence of optical distortions. In robotics, this ability is referred to as visual servoing: moving a tool or end-point to a desired location using primarily visual feedback. In this paper, we study how viewpoint-invariant visual servoing skills can be learned automatically in a robotic manipulation scenario. To this end, we train a deep recurrent controller that can automatically determine which actions move the end-point of a robotic arm to a desired object. The problem that must be solved by this controller is fundamentally ambiguous: under severe variation in viewpoint, it may be impossible to determine the actions in a single feedforward operation. Instead, our visual servoing system must use its memory of past movements to understand how the actions affect the robot motion from the current viewpoint, correcting mistakes and gradually moving closer to the target. This ability is in stark contrast to most visual servoing methods, which either assume known dynamics or require a calibration phase. We show how we can learn this recurrent controller using simulated data and a reinforcement learning objective. We then describe how the resulting model can be transferred to a real-world robot by disentangling perception from control and only adapting the visual layers. The adapted model can servo to previously unseen objects from novel viewpoints on a real-world Kuka IIWA robotic arm. For supplementary videos, see: https://fsadeghi.github.io/Sim2RealViewInvariantServo

Paper Structure

This paper contains 14 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Illustration of our learned recurrent visual servoing controller. Training is performed in simulation (top) to reach varied objects from various viewpoints. The recurrent controller learns to implicitly calibrate the image-space motion of the arm with respect to the actions, which are issued in the unknown coordinate frame of the robot. The model is then transferred to the real world by adapting the visual features, and can reach previously unseen objects from novel viewpoints (bottom). Depending on viewpoint, the same actions can move the arm in opposite directions, requiring the model to maintain a memory of past motions to do self-calibration and complete the task.
  • Figure 2: Network Architecture: The input to the network consists of a query image (top-left) and the observed image at step $t$ (left). The images are processed by separate convolutional stacks, and their features are concatenated. The concatenated feature vector is fed into an LSTM layer and outputs the policy which is an end-effector movement command in Cartesian space, in the frame of the robot (bottom right). The previously selected action is also provided to the LSTM (bottom), enabling it to implicitly calibrate the effects of actions on image-space motion. Value prediction: a separate head (top right) predicts the Q-value of the action $a_t$, and is trained with Monte Carlo return estimates. Auxiliary loss: An auxiliary loss function minimizes the localization error for the query object in the observed image. Also used in order to adapt the convolutional layers (left) with a small number of labeled real-world images.
  • Figure 3: We use randomized simulated scenes, as well as randomization of viewpoints, in order to train a recurrent controller in simulation for viewpoint invariant visual servoing.
  • Figure 4: The set of seen and unseen object used in the real-world experiments. The seen plush toys are used for adapting the visual layers to natural images, while the unseen objects are used for testing.
  • Figure 5: Comparing recurrent control vs reactive control in test scenarios with different levels of difficulty. Top row: test scenarios with three random objects. Bottom row: test scenarios with two random objects
  • ...and 5 more figures