Object-Centric Latent Action Learning
Albina Klepach, Alexander Nikulin, Ilya Zisman, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Nikita Lyubaykin, Igor Kiselev, Vladislav Kurenkov
TL;DR
This work tackles the challenge of learning latent actions from unlabeled internet video in visually distractive environments. It introduces object-centric latent action learning, combining VideoSAUR-based self-supervised object decomposition with LAPO-style latent action modeling and a Linear Action Probe for slot selection, followed by behavior cloning and minimal supervised finetuning. Across eight tasks in DCS and DMW, object-centric pretraining reduces the detrimental impact of distractors by about half of the gap to clean data, enabling robust imitation and efficient adaptation with scarce action labels. Ablations show that slot-based representations improve robustness over pixel-based approaches, with slot relevance aligning with downstream performance and robustness to the number of slots, though limitations remain regarding memory, dynamic object counts, and reliance on object-centric models. The findings suggest a practical path toward scalable, robust imitation learning from large unlabeled video by leveraging structured, object-centric representations as a strong inductive bias.
Abstract
Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle the movement of the agent and distracting background dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).
