Table of Contents
Fetching ...

Hierarchical Latent Action Model

Hanjung Kim, Lerrel Pinto, Seon Joo Kim

TL;DR

This work presents HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information, and demonstrates that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.

Abstract

Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short-horizon frame transitions and capture low-level motion while overlooking longer-term temporal structure. In contrast, actionless videos often contain temporally extended and high-level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low-level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high-level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.

Hierarchical Latent Action Model

TL;DR

This work presents HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information, and demonstrates that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.

Abstract

Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short-horizon frame transitions and capture low-level motion while overlooking longer-term temporal structure. In contrast, actionless videos often contain temporally extended and high-level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low-level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high-level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.
Paper Structure (27 sections, 5 equations, 5 figures, 1 table)

This paper contains 27 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of HiLAM. (a) Overall latent skill learning pipeline. (b) Training objectives used for latent skill learning. (c) Extracting latent actions using a pretrained inverse dynamics model (IDM).
  • Figure 2: Latent skill extraction and policy learning. (a) Latent actions $\mathbf{z}^l$ are hierarchically encoded into stage-wise representations $\mathbf{z}^s$ and then expanded back to a per-timestep latent skill sequence $\mathbf{z}^h$. (b) Overall pipeline of the hierarchical skill policy.
  • Figure 3: LIBERO benchmark results. (a) Performance of BAKU (gray) and HiLAM (blue) on the LIBERO benchmark. (b) LIBERO-Long success rate as a function of the fraction of expert demonstrations used for fine-tuning.
  • Figure 4: Qualitative results for skill boundary prediction. Using the predicted boundary indicators $b^s_t$, we assign each frame to a skill segment $k^s_t$ and display the segment ID for each segment.
  • Figure 5: Qualitative results for future frame prediction using a pretrained FDM. Given the current image $I_t$ and the predicted latent action $\hat{z}^l_t$, the model predicts the future frame $\hat{I}_{t+k}$.