Table of Contents
Fetching ...

Self-supervised Learning of Image Embedding for Continuous Control

Carlos Florensa, Jonas Degrave, Nicolas Heess, Jost Tobias Springenberg, Martin Riedmiller

TL;DR

Problem: learn goal-directed control directly from high-dimensional visual input without external rewards. Approach: recast goal-reaching as minimum-time-to-observation RL, train with goal relabeling, and enforce a structured Q-function that ties dynamics to an embedding space (bridging model-free and model-based methods). Key contributions: (i) a novel Q-structure Q(o_t,a_t,o_g)=γ^{||f(φ(o_t),a_t)-φ(o_g)||}, (ii) a practical off-policy learning loop with Retrace and MPO, and (iii) empirical evidence on three MuJoCo tasks showing self-supervised visual goal-reaching and dynamic-embedding behavior. Significance: demonstrates robust, reward-free visual control and provides a foundation for hierarchy and improved exploration in downstream tasks.

Abstract

Operating directly from raw high dimensional sensory inputs like images is still a challenge for robotic control. Recently, Reinforcement Learning methods have been proposed to solve specific tasks end-to-end, from pixels to torques. However, these approaches assume the access to a specified reward which may require specialized instrumentation of the environment. Furthermore, the obtained policy and representations tend to be task specific and may not transfer well. In this work we investigate completely self-supervised learning of a general image embedding and control primitives, based on finding the shortest time to reach any state. We also introduce a new structure for the state-action value function that builds a connection between model-free and model-based methods, and improves the performance of the learning algorithm. We experimentally demonstrate these findings in three simulated robotic tasks.

Self-supervised Learning of Image Embedding for Continuous Control

TL;DR

Problem: learn goal-directed control directly from high-dimensional visual input without external rewards. Approach: recast goal-reaching as minimum-time-to-observation RL, train with goal relabeling, and enforce a structured Q-function that ties dynamics to an embedding space (bridging model-free and model-based methods). Key contributions: (i) a novel Q-structure Q(o_t,a_t,o_g)=γ^{||f(φ(o_t),a_t)-φ(o_g)||}, (ii) a practical off-policy learning loop with Retrace and MPO, and (iii) empirical evidence on three MuJoCo tasks showing self-supervised visual goal-reaching and dynamic-embedding behavior. Significance: demonstrates robust, reward-free visual control and provides a foundation for hierarchy and improved exploration in downstream tasks.

Abstract

Operating directly from raw high dimensional sensory inputs like images is still a challenge for robotic control. Recently, Reinforcement Learning methods have been proposed to solve specific tasks end-to-end, from pixels to torques. However, these approaches assume the access to a specified reward which may require specialized instrumentation of the environment. Furthermore, the obtained policy and representations tend to be task specific and may not transfer well. In this work we investigate completely self-supervised learning of a general image embedding and control primitives, based on finding the shortest time to reach any state. We also introduce a new structure for the state-action value function that builds a connection between model-free and model-based methods, and improves the performance of the learning algorithm. We experimentally demonstrate these findings in three simulated robotic tasks.

Paper Structure

This paper contains 21 sections, 11 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Two Q-function architectures we compare to learn a visual goal-reaching policy.
  • Figure 2: Task observation, at the resolution given to the agent. No other proprioceptive or geometric information is used. The goal is also specified as an observation like the above.
  • Figure 3: Learning curves for the three environments plotting final L1 goal distance in position-space against collected environment steps.
  • Figure 4: Analysis of some distances along a trajectory