Table of Contents
Fetching ...

Control-oriented Clustering of Visual Latent Representation

Han Qi, Haocheng Yin, Heng Yang

TL;DR

Surprisingly, an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%.

Abstract

We initiate a study of the geometry of the visual representation space -- the information channel from the vision encoder to the action decoder -- in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification (arXiv:2008.08186), we empirically demonstrate the prevalent emergence of a similar law of clustering in the visual representation space. Specifically, in discrete image-based control (e.g., Lunar Lander), the visual representations cluster according to the natural discrete action labels; in continuous image-based control (e.g., Planar Pushing and Block Stacking), the clustering emerges according to "control-oriented" classes that are based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output. Each of the classes corresponds to one relative pose orthant (REPO). Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with limited expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.

Control-oriented Clustering of Visual Latent Representation

TL;DR

Surprisingly, an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%.

Abstract

We initiate a study of the geometry of the visual representation space -- the information channel from the vision encoder to the action decoder -- in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification (arXiv:2008.08186), we empirically demonstrate the prevalent emergence of a similar law of clustering in the visual representation space. Specifically, in discrete image-based control (e.g., Lunar Lander), the visual representations cluster according to the natural discrete action labels; in continuous image-based control (e.g., Planar Pushing and Block Stacking), the clustering emerges according to "control-oriented" classes that are based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output. Each of the classes corresponds to one relative pose orthant (REPO). Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with limited expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.
Paper Structure (57 sections, 10 equations, 31 figures, 3 tables)

This paper contains 57 sections, 10 equations, 31 figures, 3 tables.

Figures (31)

  • Figure 1: Per-class (red, blue, green) globally centered features (points) and mean vectors (black lines with $\star$ endpoints) from the penultimate latent space when cloning the optimal bang-bang policy \ref{['eq:optimal-bang-bang']} from expert demonstrations. Numbers in the blue band represent the lengths of the three per-class mean vectors, and numbers in the red band represent the angles spanned by pairs of per-class mean vectors. The lengths and angles tend to be equal to each other as training progresses.
  • Figure 2: Investigation of a law of clustering, similar to NC, in the visual representation space.
  • Figure 3: Emergence of neural collapse in the visual representation space for Lunar Lander. From left to right, it shows three NC metrics w.r.t. training epochs using ResNet18 as the vision encoder and an MLP as the action decoder, with four discrete actions as classification labels.
  • Figure 4: Control-oriented classification for continuous vision-based control tasks. Left: planar pushing; Right: block stacking.
  • Figure 5: Test scores w.r.t. training epoches of four different instantiations of the image-based control pipeline for planar pushing. In (a) and (b) we show test scores of three random seeds. In (c) and (d) we show test scores of a single seed because using LSTM as the action decoder leads to poor test-time performance, an observation that is consistent with chi2023diffusion.
  • ...and 26 more figures

Theorems & Definitions (1)

  • Remark 1: Finegrained Repos