Table of Contents
Fetching ...

LAOF: Robust Latent Action Learning with Optical Flow Constraints

Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, Wei Li

TL;DR

The paper tackles robust latent action learning from vast amounts of action-free video by leveraging optical flow as a pseudo-supervision signal. It introduces LAOF, which adds a dedicated flow decoder to map latent actions to optical flow, jointly training with inverse and forward dynamics to constrain physical motion, and extends to LAOF-Action to incorporate sparse action labels. Across LIBERO and PROCGEN, optical-flow constraints stabilize training and improve latent-action quality, enabling strong downstream performance even with very limited or no action labels. Ablation studies show that a dedicated flow decoder yields the best results, and the approach remains beneficial up to about 10% labeled data, offering a practical path toward scalable embodied foundation models.

Abstract

Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10 percent. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1 percent of action labels.

LAOF: Robust Latent Action Learning with Optical Flow Constraints

TL;DR

The paper tackles robust latent action learning from vast amounts of action-free video by leveraging optical flow as a pseudo-supervision signal. It introduces LAOF, which adds a dedicated flow decoder to map latent actions to optical flow, jointly training with inverse and forward dynamics to constrain physical motion, and extends to LAOF-Action to incorporate sparse action labels. Across LIBERO and PROCGEN, optical-flow constraints stabilize training and improve latent-action quality, enabling strong downstream performance even with very limited or no action labels. Ablation studies show that a dedicated flow decoder yields the best results, and the approach remains beneficial up to about 10% labeled data, offering a practical path toward scalable embodied foundation models.

Abstract

Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10 percent. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1 percent of action labels.

Paper Structure

This paper contains 21 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of LAOF framework: Consecutive observations $(o_t, o_{t+1})$ and their corresponding RGB-formatted optical flow $f_{\text{rgb},t}$ are encoded into feature space $(s_t, s_{t+1}, f_t)$. The inverse and forward dynamics models, along with the flow decoder, are then jointly optimized under the combined supervision of next-state reconstruction and optical flow constraints.
  • Figure 1: Effect of continuous latent actions on downstream imitation learning performance on LIBERO. MSE denotes the mean squared error between the predicted and ground-truth actions, Succ. denotes the average task success rate over 1000 trials, w/ OF denotes that the method uses optical flow constraints, and Avg. Impr. indicates the average improvement over LAPO. LAOM-Action and LAOF-Action are action-supervised methods, evaluated under a 1$\%$ action ratio.
  • Figure 2: Visualization of optical flow on LIBERO and PROCGEN. Inter-frame optical flow, estimated using RAFT teed2020raft, is shown below each image, representing the motion from the current frame to the next. Colors indicate motion direction, with purple corresponding to upward movement. In tasks where all distractors are static, the agent’s optical flow can be directly extracted, as illustrated by the robotic arm motions in LIBERO. For scenarios involving dynamic distractors, LangSAM langSAM is employed to isolate object-centric optical flow.
  • Figure 2: Effect of discrete latent actions on downstream reinforcement learning performance on PROCGEN. Acc. denotes the accuracy of action classification ($\%$), Return denotes the average normalized episodic return over 1000 trials.
  • Figure 3: Subfigures (a) and (b) compare downstream task performance between continuous (solid lines) and discrete (dashed lines) latent action representations, using normalized episodic return for PROCGEN and success rate for LIBERO. The larger area under the solid lines compared to the dashed lines indicates that continuous representations outperform discrete ones across all downstream tasks (Q1). Subfigures (c) and (d) illustrate that our latent action evaluation metric is highly correlated with downstream task performance, with the mean Pearson correlation coefficient averaged across all tasks being 0.8288 for PROCGEN and –0.7311 for LIBERO.
  • ...and 5 more figures