Table of Contents
Fetching ...

Latent Action Learning Requires Supervision in the Presence of Distractors

Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov

TL;DR

The paper investigates latent action learning under action-correlated distractors and finds that standard LAPO struggles in such settings. It introduces LAOM, a modified LAPO architecture that significantly boosts latent-action quality and downstream performance, but still falls short of distractor-free baselines. Most notably, the authors show that incorporating supervision from a small fraction of ground-truth actions during latent-action learning yields large downstream gains, suggesting that supervising LAM training is essential in realistic, distractor-rich scenarios. This challenges the common pre-training pipeline that decouples latent-action learning from action decoding and points toward practical strategies for leveraging web-scale video data for embodied AI.

Abstract

Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.

Latent Action Learning Requires Supervision in the Presence of Distractors

TL;DR

The paper investigates latent action learning under action-correlated distractors and finds that standard LAPO struggles in such settings. It introduces LAOM, a modified LAPO architecture that significantly boosts latent-action quality and downstream performance, but still falls short of distractor-free baselines. Most notably, the authors show that incorporating supervision from a small fraction of ground-truth actions during latent-action learning yields large downstream gains, suggesting that supervising LAM training is essential in realistic, distractor-rich scenarios. This challenges the common pre-training pipeline that decouples latent-action learning from action decoding and points toward practical strategies for leveraging web-scale video data for embodied AI.

Abstract

Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.

Paper Structure

This paper contains 15 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: We show that in the presence of distractors, LAPO struggles to learn latent actions useful for pre-training and that simple BC or IDM are more effective. We propose LAOM, a simple modification that doubles the performance but still underperforms. Thus, we propose to reuse available ground-truth action labels to supervise latent action learning, which significantly improves the performance, achieving normalized score of 0.44. It recovers almost half the performance of BC with access to the full action-labeled dataset, while having access to only 2.5%. Results are averaged over four environments from Distracting Control Suite, three random seeds each. We provide per-environment plots on \ref{['fig:final-res']}. See \ref{['exp:setup']} for the evaluation protocol, \ref{['exp:lapo-laom', 'exp:laom-supervision']} for method details.
  • Figure 2: Visualization of the environments from the Distracting Control Suite (DCS) used in our work. Top row: without any distractors, identical to the original DeepMind Control Suite. Bottom row: with distractors, which consists of dynamic background videos, agent color change and camera shaking. See \ref{['exp:setup']} for additional details.
  • Figure 3: Overview of the latent action learning pipeline. In the first stage, the Latent Action Model (LAM) is pre-trained to infer latent actions between consecutive observations. In the second stage, the LAM is used to relabel the entire dataset with latent actions, which are then used for behavioral cloning. Finally, a decoder is trained to map from latent to true actions using a small number of labelled trajectories. In our work, we do not modify this pipeline in any way; we only examine the LAM architecture itself (see \ref{['fig:lapo-arc-viz']}).
  • Figure 4: Simplified architecture visualization of LAPO, and LAOM - our proposed modification. LAPO consists of IDM and FMD, both with separate encoders, uses latent action quantization and predict next observation in image space via the decoder in FDM. LAOM incorporates multi-step IDM, removes quantization and does not reconstruct images, relying on latent temporal consistency loss. Images are encoded by shared encoder, while IDM and FDM operate in compact latent space. When small number of ground-truth action labels is available, we use them for supervision, linearly predicting from latent actions. For detailed description see \ref{['exp:lapo-laom']}.
  • Figure 5: Quality of latent actions learned by LAPO. We show that quantization of latent actions significantly reduces the quality of actions, even on data without distractors, where LAPO should work without problems. Removing the quantization recovers the latent action quality, but additional modifications are needed to improve LAPO performance with distractors. Results are averaged across all four environments, each with three random seeds.
  • ...and 10 more figures