Table of Contents
Fetching ...

PVEs: Position-Velocity Encoders for Unsupervised Learning of Structured State Representations

Rico Jonschkowski, Roland Hafner, Jonathan Scholz, Martin Riedmiller

TL;DR

PVEs propose unsupervised learning of structured state representations by splitting latent space into position and velocity components, with velocity inferred via finite differences. The method uses robotic priors—variation, slowness, inertia, conservation, and controllability—to train encoders without decoders or reconstruction, yielding low-dimensional, task-relevant representations from pixel observations. Across three MuJoCo tasks, PVEs recover meaningful topologies, admit consistent representations across viewpoints, and enable reinforcement learning with improved or comparable control performance. Ball in cup remains challenging due to rapid dynamics and noisy velocity estimates, guiding future work toward richer priors and tighter integration with RL.

Abstract

We propose position-velocity encoders (PVEs) which learn---without supervision---to encode images to positions and velocities of task-relevant objects. PVEs encode a single image into a low-dimensional position state and compute the velocity state from finite differences in position. In contrast to autoencoders, position-velocity encoders are not trained by image reconstruction, but by making the position-velocity representation consistent with priors about interacting with the physical world. We applied PVEs to several simulated control tasks from pixels and achieved promising preliminary results.

PVEs: Position-Velocity Encoders for Unsupervised Learning of Structured State Representations

TL;DR

PVEs propose unsupervised learning of structured state representations by splitting latent space into position and velocity components, with velocity inferred via finite differences. The method uses robotic priors—variation, slowness, inertia, conservation, and controllability—to train encoders without decoders or reconstruction, yielding low-dimensional, task-relevant representations from pixel observations. Across three MuJoCo tasks, PVEs recover meaningful topologies, admit consistent representations across viewpoints, and enable reinforcement learning with improved or comparable control performance. Ball in cup remains challenging due to rapid dynamics and noisy velocity estimates, guiding future work toward richer priors and tighter integration with RL.

Abstract

We propose position-velocity encoders (PVEs) which learn---without supervision---to encode images to positions and velocities of task-relevant objects. PVEs encode a single image into a low-dimensional position state and compute the velocity state from finite differences in position. In contrast to autoencoders, position-velocity encoders are not trained by image reconstruction, but by making the position-velocity representation consistent with priors about interacting with the physical world. We applied PVEs to several simulated control tasks from pixels and achieved promising preliminary results.

Paper Structure

This paper contains 29 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: PVEs encode an observation into a low-dimensional position state. From a sequence of such position states, they estimate velocities. PVEs learn the encoding by optimizing consistency of positions and velocities with robotic priors.
  • Figure 2: Three control tasks from pixel input
  • Figure 3: For the inverted pendulum, PVEs learn a circular position representation that allows accurate velocity estimation. Each dot in (a) and (b) is the encoding of a single observation. The color denotes the reward received with the observation (red = high, blue = low). Black dots in (b) show the encoding of the observation sequence in (c). Black lines show the estimated velocities. Supplementary videos: http://youtu.be/ipGe7Lph0Lw shows the learning process, http://youtu.be/u0bQwz89h1I demonstrates the learned PVE.
  • Figure 4: For cart-pole, PVEs learn equivalent state representations from different observations. Supplementary videos: learning process for the moving camera http://youtu.be/RKlciWWuJfc and static camera http://youtu.be/MYxrA1Bw6MU, learned PVE with the moving camera http://youtu.be/67QZRsLNTAE.
  • Figure 5: Learned position-velocity representation for ball in cup. Supplementary videos: http://youtu.be/3fLaSL8d4TY shows the learning process, http://youtu.be/lIhEGv5kLFo demonstrates the learned PVE.
  • ...and 1 more figures