Table of Contents
Fetching ...

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

Han Lin, Tushar Nagarajan, Nicolas Ballas, Mido Assran, Mojtaba Komeili, Mohit Bansal, Koustuv Sinha

TL;DR

This work shows that a strong off-the-shelf frozen pretrained visual encoder, along with a well designed prediction model, can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR.

Abstract

Procedural video representation learning is an active research area where the objective is to learn an agent which can anticipate and forecast the future given the present video input, typically in conjunction with textual annotations. Prior works often rely on large-scale pretraining of visual encoders and prediction models with language supervision. However, the necessity and effectiveness of extending compute intensive pretraining to learn video clip sequences with noisy text supervision have not yet been fully validated by previous works. In this work, we show that a strong off-the-shelf frozen pretrained visual encoder, along with a well designed prediction model, can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR. Instead of learning representations from pixel space, our method utilizes the latent embedding space of publicly available vision encoders. By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting through iterative denoising - leveraging the recent advances in diffusion transformers (Peebles & Xie, 2023). Empirical studies over a total of five procedural learning tasks across four datasets (NIV, CrossTask, COIN and Ego4D-v2) show that our model advances the strong baselines in long-horizon action anticipation (+2.6% in Verb ED@20, +3.1% in Noun ED@20), and significantly improves the SoTA in step forecasting (+5.0%), task classification (+3.8%), and procedure planning tasks (up to +2.28% in success rate, +3.39% in mAcc, and +0.90% in mIoU).

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

TL;DR

This work shows that a strong off-the-shelf frozen pretrained visual encoder, along with a well designed prediction model, can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR.

Abstract

Procedural video representation learning is an active research area where the objective is to learn an agent which can anticipate and forecast the future given the present video input, typically in conjunction with textual annotations. Prior works often rely on large-scale pretraining of visual encoders and prediction models with language supervision. However, the necessity and effectiveness of extending compute intensive pretraining to learn video clip sequences with noisy text supervision have not yet been fully validated by previous works. In this work, we show that a strong off-the-shelf frozen pretrained visual encoder, along with a well designed prediction model, can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR. Instead of learning representations from pixel space, our method utilizes the latent embedding space of publicly available vision encoders. By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting through iterative denoising - leveraging the recent advances in diffusion transformers (Peebles & Xie, 2023). Empirical studies over a total of five procedural learning tasks across four datasets (NIV, CrossTask, COIN and Ego4D-v2) show that our model advances the strong baselines in long-horizon action anticipation (+2.6% in Verb ED@20, +3.1% in Noun ED@20), and significantly improves the SoTA in step forecasting (+5.0%), task classification (+3.8%), and procedure planning tasks (up to +2.28% in success rate, +3.39% in mAcc, and +0.90% in mIoU).
Paper Structure (34 sections, 7 figures, 7 tables, 2 algorithms)

This paper contains 34 sections, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of our VEDiT training pipeline.Model architecture (left): We introduce masked clip-level latent prediction as our training objective, where we train a Vector Embedding DiT (VEDiT) to iteratively denoise $T$ steps from random gaussian noise with flow matching noise scheduler. Downstream tasks (right): We train VEDiT with a light-weight attentive classifier bardes2024revisiting with cross-entropy loss for the following tasks. (a) Step forecasting / action anticipation task: predict the embeddings of next unseen clip from observed clips with VEDiT. (b) Procedure planning task: predict the embeddings of intermediate unseen clips from observed starting and goal clips with VEDiT. (c) Procedural activity classification task: given a sequence of observed video clips, predict the label of the procedural video.
  • Figure 2: Vector Embedding Diffusion Transformer (VEDiT) architecture. During training, our model first uses frozen visual encoders to convert observed video clips into corresponding video embeddings. Then random Gaussian noises are generated as the initial video embeddings of unseen target clips. The DiT-based prediction model processes both seen and target video embeddings in two separate branches, and fuses their information via joint attention blocks where $\text{Q}'=\text{Concat}[\text{Q}^\text{s}, \text{Q}^\text{t}]$, $\text{K}'=\text{Concat}[\text{K}^\text{s}, \text{K}^\text{t}]$, $\text{V}'=\text{Concat}[\text{V}^\text{s}, \text{V}^\text{t}]$. To enable temporal modeling of clips, Rotary positional embeddings (RoPE) is applied to $\text{Q}'$ and $\text{K}'$ before being input to the attention module. The denoised target clip embeddings are then given as input to the attentive classifier in downstream tasks.
  • Figure 3: SCHEMA w/ VEDiT is more stable than SCHEMA w/ vanilla transformer as we increase the number of transformer blocks.
  • Figure 4: Ablation of attention mechanisms, including our default joint attention, self-attention, and cross-attention. We denote seen and target video clip embeddings as $\bm{v}^\text{s}$ and $\bm{v}^\text{t}$ respectively.
  • Figure 5: Top-1 classification accuracy on COIN dataset with different attention mechanisms.
  • ...and 2 more figures