Table of Contents
Fetching ...

Object-Centric Latent Action Learning

Albina Klepach, Alexander Nikulin, Ilya Zisman, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Nikita Lyubaykin, Igor Kiselev, Vladislav Kurenkov

TL;DR

This work tackles the challenge of learning latent actions from unlabeled internet video in visually distractive environments. It introduces object-centric latent action learning, combining VideoSAUR-based self-supervised object decomposition with LAPO-style latent action modeling and a Linear Action Probe for slot selection, followed by behavior cloning and minimal supervised finetuning. Across eight tasks in DCS and DMW, object-centric pretraining reduces the detrimental impact of distractors by about half of the gap to clean data, enabling robust imitation and efficient adaptation with scarce action labels. Ablations show that slot-based representations improve robustness over pixel-based approaches, with slot relevance aligning with downstream performance and robustness to the number of slots, though limitations remain regarding memory, dynamic object counts, and reliance on object-centric models. The findings suggest a practical path toward scalable, robust imitation learning from large unlabeled video by leveraging structured, object-centric representations as a strong inductive bias.

Abstract

Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle the movement of the agent and distracting background dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).

Object-Centric Latent Action Learning

TL;DR

This work tackles the challenge of learning latent actions from unlabeled internet video in visually distractive environments. It introduces object-centric latent action learning, combining VideoSAUR-based self-supervised object decomposition with LAPO-style latent action modeling and a Linear Action Probe for slot selection, followed by behavior cloning and minimal supervised finetuning. Across eight tasks in DCS and DMW, object-centric pretraining reduces the detrimental impact of distractors by about half of the gap to clean data, enabling robust imitation and efficient adaptation with scarce action labels. Ablations show that slot-based representations improve robustness over pixel-based approaches, with slot relevance aligning with downstream performance and robustness to the number of slots, though limitations remain regarding memory, dynamic object counts, and reliance on object-centric models. The findings suggest a practical path toward scalable, robust imitation learning from large unlabeled video by leveraging structured, object-centric representations as a strong inductive bias.

Abstract

Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle the movement of the agent and distracting background dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).

Paper Structure

This paper contains 32 sections, 4 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: Main Results. Our methods, LAPO-slots and LAPO-masks, leverage object-centric representations to significantly improve latent action learning under visual distractions. Compared to baseline (LAPOschmidt2024learningactactions, trained on raw videos with distraction), our approaches reduce the performance gap toward the clean-data upper bound (LAPO-clean, trained on clean videos without distractions) by 50%. Distractions include camera movements, color variations, and dynamic background videos. Downstream performance is normalized to a behavior cloning agent trained on all ground-truth action labels. Results averaged over three random seeds. See \ref{['sec:exp']} for details.
  • Figure 2: General overview of our method pre-training pipeline. Object-Centric Pretraining: We decompose video sequences into interpretable object slots. A linear probe trained on slot representations automatically selects task-relevant slots by identifying those most predictive of actions. Latent Action Learning: We train a latent action model based on LAPO, learning inverse and forward dynamics in the slot space. Behavior Cloning and Fine-tuning: A behavior cloning agent is trained on the inferred latent actions. The resulting policy is then fine-tuned using a small number of trajectories with ground-truth actions ($\leq2.5$% of total data), enabling strong downstream performance with minimal supervision.
  • Figure 3: Visuals from Distraction Control Suite. From top to bottom: cheetah-run, walker-run, hopper-hop, humanoid-walk. From left to right: the distracted observation (background video, color, and camera position variations), the non-distracted observation, the mixture of slot decoder masks obtained after object-centric pretraining, and the main object slot decoder mask selected after object-centric pretraining.
  • Figure 4: Normalized evaluation returns and success rates of the BC agent trained on latent actions for varying numbers of fine-tuning labeled trajectories. TL;DR: Object-centric learning improves evaluation returns in DCS tasks and success rate in goal-based DMW task for all tasks. The plots are arranged from left to right in order of increasing task complexity. The values are averaged across three random seeds. The BC agent trained with access to the full dataset of ground-truth actions would return a score of 1 for each task.
  • Figure 5: Slot selection study on basketball task. (a,b) Linear action probes under varying labeled trajectories budget and corresponding examples of slot masks for basketball task for $K=4$. (c) BC performance on different slots for $K=4$ and $K=15$ on basketball task. (d) Linear action probe scores vs. normalized BC success rates (on 128 trajectories). TL;DR: Probe-based selection correlates with downstream performance. (e) BC performance on varying $K$ parameter on basketball task. Full-action BC achieves score of 1. TL;DR: Regardless of the $K$ parameter LAPO-slots outperform baseline LAPO.
  • ...and 21 more figures