Table of Contents
Fetching ...

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Ali Zare, Yulei Niu, Hammad Ayyubi, Shih-fu Chang

TL;DR

Adaptive procedure planning for instructional videos is challenged by unknown plan lengths, temporal relations among steps, and annotation costs. The authors propose the Retrieval-Augmented Planner (RAP), which combines an auto-regressive base planner with a retrieval memory to refine action sequences, and employs weak supervision from unannotated videos via a grounding-based pseudo-annotation pipeline. They also introduce mean edit-score (mES) to evaluate variable-length plans and present a two-stage training regime that first trains the base planner and then jointly trains the retrieval-enhanced model. On CrossTask and COIN, RAP outperforms fixed-horizon baselines and existing retrieval-augmented methods, demonstrating strong generalization and practical potential for scalable adaptive procedure planning.

Abstract

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges, we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

TL;DR

Adaptive procedure planning for instructional videos is challenged by unknown plan lengths, temporal relations among steps, and annotation costs. The authors propose the Retrieval-Augmented Planner (RAP), which combines an auto-regressive base planner with a retrieval memory to refine action sequences, and employs weak supervision from unannotated videos via a grounding-based pseudo-annotation pipeline. They also introduce mean edit-score (mES) to evaluate variable-length plans and present a two-stage training regime that first trains the base planner and then jointly trains the retrieval-enhanced model. On CrossTask and COIN, RAP outperforms fixed-horizon baselines and existing retrieval-augmented methods, demonstrating strong generalization and practical potential for scalable adaptive procedure planning.

Abstract

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges, we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.
Paper Structure (26 sections, 10 equations, 8 figures, 13 tables)

This paper contains 26 sections, 10 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Examples of adaptive procedures in instructional videos. Procedure length varies with differences between initial and goal visual states, even for the same task. Previous works assume a fixed, known sequence length, requiring a separate model for each action-horizon. This figure contrasts traditional procedure planning (A) with our adaptive approach (B), which trains a unified model without knowing the procedure length in advance.
  • Figure 2: Framework overview: Visual observations $o_s$ and $o_g$ are encoded to $v_s$ and $v_g$. A Task-Classifier predicts a task class and its representation $\hat{c}$ using $v_s$ and $v_g$. These, along with a learnable $\texttt{START}$ token, are fed into the auto-regressive base planner $F_{base}$ for sequential action embedding prediction. For each prediction at position $t$, the retrieval component uses the predicted context vector $\varsigma_t$ to estimate a probability distribution $P_{\text{retr,t}}$ over the closest $K$ key-value pairs in memory for kNN linear interpolation with $P_{\text{base,t}}$ to estimate the final probability distribution $P_{\text{all,t}}$.
  • Figure 3: The Retrieval-Augmented Planner module uses the predicted context vector $\varsigma_t$ to retrieve the $K$ closest keys and their values from memory, estimating a probability distribution $P_{\text{retr,t}}$ over these entries for next action probability estimation $P_{\text{all,t}}$.
  • Figure 4: Illustration of the advantage of edit score, in evaluation of long varied-length sequences. As it can be seen, despite the prediction being plausible and acceptable, accuracy and success rate, falsely flag this scenario as a miss.
  • Figure 5: Sample plausible predicted plan sequences generated by RAP vs the ground-truth (GT).
  • ...and 3 more figures