Table of Contents
Fetching ...

CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

Lei Shi, Andreas Bulling

TL;DR

CLAD tackles vision-language procedure planning by marrying a Variational Autoencoder–learned latent constraint space with a diffusion model to generate intermediate actions conditioned on start/goal visual observations and language. The core idea is to inject the VAE-derived start and goal constraints into the deepest layer of a U-Net used in a DDPM, guiding action sequence generation toward semantically plausible plans. The method achieves state-of-the-art performance on CrossTask, Coin, and NIV, with ablations showing the latent constraints and their integration are critical to gains, especially on smaller datasets. This approach enables robust multimodal planning and has practical implications for human–AI collaboration in instructional tasks, where visual observations and natural language descriptions jointly inform the plan.

Abstract

We propose CLAD -- a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

TL;DR

CLAD tackles vision-language procedure planning by marrying a Variational Autoencoder–learned latent constraint space with a diffusion model to generate intermediate actions conditioned on start/goal visual observations and language. The core idea is to inject the VAE-derived start and goal constraints into the deepest layer of a U-Net used in a DDPM, guiding action sequence generation toward semantically plausible plans. The method achieves state-of-the-art performance on CrossTask, Coin, and NIV, with ablations showing the latent constraints and their integration are critical to gains, especially on smaller datasets. This approach enables robust multimodal planning and has practical implications for human–AI collaboration in instructional tasks, where visual observations and natural language descriptions jointly inform the plan.

Abstract

We propose CLAD -- a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

Paper Structure

This paper contains 38 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: CLAD is a novel method for predicting intermediate actions in procedure planning tasks using vision-language input. The start and goal actions are first encoded into a VAE latent space as constraints and then integrated into the latent space of a diffusion model to steer the action generation.
  • Figure 2: Overview of our proposed method CLAD. Processing: We first process the input data by extracting features from the natural language input and visual input using pretrained text encoder and image encoder. Latent Constraint Learning: The text feature and image feature are concatenated as the input for the VAE. Task Prediction: The visual features of the start and goal are used to predict the task class. Procedure Planning: The diffusion model task start and goal visual features and predicted task class as input to generate action sequence. The learnt latent constraints from VAE are integrated into the deepest layer of U-Net.
  • Figure 3: Qualitative results on CrossTask dataset. The time horizon is 4 and the data curation setting is PDPP. Red boxes are the ground truth actions and blue boxes are the predicted actions.
  • Figure 4: More qualitative results on CrossTask dataset. We show some cases when our model fails to predict the correct action sequences.