Latent Action Learning Requires Supervision in the Presence of Distractors
Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov
TL;DR
The paper investigates latent action learning under action-correlated distractors and finds that standard LAPO struggles in such settings. It introduces LAOM, a modified LAPO architecture that significantly boosts latent-action quality and downstream performance, but still falls short of distractor-free baselines. Most notably, the authors show that incorporating supervision from a small fraction of ground-truth actions during latent-action learning yields large downstream gains, suggesting that supervising LAM training is essential in realistic, distractor-rich scenarios. This challenges the common pre-training pipeline that decouples latent-action learning from action decoding and points toward practical strategies for leveraging web-scale video data for embodied AI.
Abstract
Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
