Table of Contents
Fetching ...

Unsupervised Learning of Object Keypoints for Perception and Control

Tejas Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, Volodymyr Mnih

TL;DR

This work introduces Transporter, an unsupervised framework that learns stable object keypoints by transporting features between video frames. The learned keypoints provide a compact, geometric representation that enables data-efficient reinforcement learning and reward-free exploration on Atari, outperforming several baselines. By grounding control in discrete keypoint coordinates and associated features, Transporter achieves long-horizon tracking and facilitates exploration with controllable skills, demonstrating practical, reward-free utility in complex environments. The approach offers a reusable object-centric representation that can be scaled and extended to richer perceptual settings and dynamics modeling.

Abstract

The study of object representations in computer vision has primarily focused on developing representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates. Our method learns from raw video frames in a fully unsupervised manner, by transporting learnt image features between video frames using a keypoint bottleneck. The discovered keypoints track objects and object parts across long time-horizons more accurately than recent similar methods. Furthermore, consistent long-term tracking enables two notable results in control domains -- (1) using the keypoint co-ordinates and corresponding image features as inputs enables highly sample-efficient reinforcement learning; (2) learning to explore by controlling keypoint locations drastically reduces the search space, enabling deep exploration (leading to states unreachable through random action exploration) without any extrinsic rewards.

Unsupervised Learning of Object Keypoints for Perception and Control

TL;DR

This work introduces Transporter, an unsupervised framework that learns stable object keypoints by transporting features between video frames. The learned keypoints provide a compact, geometric representation that enables data-efficient reinforcement learning and reward-free exploration on Atari, outperforming several baselines. By grounding control in discrete keypoint coordinates and associated features, Transporter achieves long-horizon tracking and facilitates exploration with controllable skills, demonstrating practical, reward-free utility in complex environments. The approach offers a reusable object-centric representation that can be scaled and extended to richer perceptual settings and dynamics modeling.

Abstract

The study of object representations in computer vision has primarily focused on developing representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates. Our method learns from raw video frames in a fully unsupervised manner, by transporting learnt image features between video frames using a keypoint bottleneck. The discovered keypoints track objects and object parts across long time-horizons more accurately than recent similar methods. Furthermore, consistent long-term tracking enables two notable results in control domains -- (1) using the keypoint co-ordinates and corresponding image features as inputs enables highly sample-efficient reinforcement learning; (2) learning to explore by controlling keypoint locations drastically reduces the search space, enabling deep exploration (leading to states unreachable through random action exploration) without any extrinsic rewards.

Paper Structure

This paper contains 23 sections, 2 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Transporter. Our model leverages object motion to discover keypoints by learning to transform a source video frame ($\bm{x}_s$) into another target frame ($\bm{x}_{t}$) by transporting image features at the discovered object locations. During training, spatial feature maps $\Phi(\bm{x})$ and keypoints co-ordinates $\Psi(\bm{x})$ are predicted for both the frames using a ConvNet and the fully-differentiable KeyNetJakab18 respectively. The keypoint co-ordinates are transformed into Gaussian heatmaps (same spatial dimensions as feature maps) $\mathcal{H}_{\Psi(\bm{x})}$. We perform two operations in the transport phase: (1) the features of the source frame are set to zero at both locations $\mathcal{H}_{\Psi(\bm{x}_{s})}$ and $\mathcal{H}_{\Psi(\bm{x}_{t})}$; (2) the features in the source image $\Phi(x_s)$ at the target positions $\Psi(\bm{x}_{t})$ are replaced with the features from the target image $\mathcal{H}_{\Psi(\bm{x}_{t})} \cdot \Phi(x_{t})$. The final refinement ConvNet (which maps from the transported feature map to an image) then has two tasks: (i) to inpaint the missing features at the source position; and (ii) to clean up the image around the target positions. During inference, keypoints can be extracted for a single frame via a feed-forward pass through the KeyNet ($\Psi$).
  • Figure 2: Keypoint visualisation. Visualisations from our and state-of-the-art unsupervised object keypoint discovery methods: Jakab et al.Jakab18 and Zhang et al.zhang2018unsupervised on Atari ALE bellemare2013arcade and Manipulator tassa2018deepmind domains. Our method learns more spatially aligned keypoints, e.g.frosbite and stack'_4 (see \ref{['s:kpt-eval']}). Quantitative evaluations are given in \ref{['f:quantitative']} and further visualisations in the supplementary material.
  • Figure 3: Temporal consistency of keypoints. Our learned keypoints are temporally consistent across hundreds of environment steps, as demonstrated in this classical hard exploration game called montezuma's revenge bellemare2013arcade. Additionally, we also predict the most controllable keypoint denoted by the triangular markers, without using any environment rewards. This prediction often corresponds to the avatar in the game and it is consistently tracked across different parts of the state space. See \ref{['s:efficient']} for further discussion.
  • Figure 4: Long-term tracking evaluation. We compare long-term tracking ability of our keypoint detector against Jakab et al.Jakab18 and Zhang et al.zhang2018unsupervised (visualisations in \ref{['f:qualitative']} and supplementary material). We report precision and recall for trajectories of varying lengths (lengths $= 1$ -- $200$ frames; each frame corresponds to 4 action repeats) against ground-truth keypoints on Atari ALE bellemare2013arcade and Manipulator tassa2018deepmind domains. Our method significantly outperforms the baselines on all games ($100\%$ on pong), except for ms'_pacman where we perform similarly especially for long trajectories (length $= 200$). See \ref{['s:kpt-eval']} for further discussion.
  • Figure 5: Agent architecture for data-efficient reinforcement learning.Transporter is trained off-line with data collected using a random policy. A recurrent variant of the neural-fitted Q-learning algorithm riedmiller2005neural rapidly learns control policies using keypoint co-ordinates and features at the corresponding locations given game rewards.
  • ...and 11 more figures