CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning
Lei Shi, Andreas Bulling
TL;DR
CLAD tackles vision-language procedure planning by marrying a Variational Autoencoder–learned latent constraint space with a diffusion model to generate intermediate actions conditioned on start/goal visual observations and language. The core idea is to inject the VAE-derived start and goal constraints into the deepest layer of a U-Net used in a DDPM, guiding action sequence generation toward semantically plausible plans. The method achieves state-of-the-art performance on CrossTask, Coin, and NIV, with ablations showing the latent constraints and their integration are critical to gains, especially on smaller datasets. This approach enables robust multimodal planning and has practical implications for human–AI collaboration in instructional tasks, where visual observations and natural language descriptions jointly inform the plan.
Abstract
We propose CLAD -- a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.
