Table of Contents
Fetching ...

ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos

Lei Shi, Paul Bürkner, Andreas Bulling

TL;DR

It is shown that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.

Abstract

We present ActionDiffusion -- a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.

ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos

TL;DR

It is shown that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.

Abstract

We present ActionDiffusion -- a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.
Paper Structure (28 sections, 12 equations, 6 figures, 6 tables)

This paper contains 28 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Procedure planning in instructional videos using diffusion models. Upper section: Procedure planning task is to generate intermediate actions given the start and goal observation. Lower left section: Previous work (Projected Diffusion) wang2023pdpp does not take the temporal dependencies between actions into account. Lower right section: Our method incorporates these dependencies into the diffusion model.
  • Figure 2: Overview of short. From an instructional video, we extract the visual feature of the start state $o_s$ and the goal state $o_g$ as well as the features of actions $a_{e_{1:T}}$. We use the task class $c$, one-hot action class $a_{1:T}$, $o_s$ and $o_g$ as the input of the diffusion model. Note that in the training, we use the ground truth task class $c$ and predicted task class $\hat{c}$ during inference. A separate task classifier is trained to get $\hat{c}$. In the noise-adding phase in training, the noise is added on $a_{1:T}$. For each action, we add all previous action embeddings and the current action embedding in addition to the Gaussian noise. In the denoising phase during inference, we use the U-Net with attention to predict the action-aware noise to denoise $x_n$. The predicted action plan is the action sequence $\hat{a}_{1:T}$ from the reconstructed input $x_0$.
  • Figure 3: Architecture of the noise prediction neural network $\epsilon_\theta$. The network $\epsilon_\theta$ is based on U-Net and incorporates attention mechanisms.
  • Figure 4: Examples of action embedding distributions from the CrossTask (a), Coin (b), and NIV datasets (c).
  • Figure 5: The distributions of diffusion model input after $N$ steps of noise-adding for time horizon $T=3$. Each column shows the distributions at $a_i, i\in T$. The distribution in blue uses action embedding with Gaussian noise for noise-adding stage. The distribution in orange uses Gaussian noise only. The first row shows the distributions from CrossTask dataset. The second row shows the distributions from Coin dataset. The third row shows the distributions from NIV dataset.
  • ...and 1 more figures