Table of Contents
Fetching ...

CI w/o TN: Context Injection without Task Name for Procedure Planning

Xinjie Li

TL;DR

This work tackles procedural planning in instructional videos under ultra-weak supervision by eliminating intermediate supervision and task names. It uses zero-shot BLIP-generated captions of the start and goal observations as context, training a context feature with a contrastive loss and injecting it into a context-augmented, memory-augmented transformer decoder to produce action sequences. Training losses include a contrastive term $\mathcal{L}_c$, a cross-entropy term $\mathcal{L}_{ca}$ for action prediction, and an adversarial $\mathcal{L}_{adv}$ to improve realism, while inference creates multiple plans with latent $z$ and refines via Viterbi post-processing using a learned transition matrix $A$ and emission $B$. Experiments on CrossTask and COIN show competitive performance across multiple metrics, validating the hypothesis that intermediate supervision mainly serves as contextual information and that caption-based context can compensate for weaker supervision.

Abstract

This paper explores the challenge of procedure planning in instructional videos, which involves creating goal-directed plans based on visual start and goal observations from videos. Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision. However, with the advent of large language models, even given only the task name, these models can produce a detailed plan. In this study, we propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models since they require good prompts with sufficient information. Specifically, we hypothesize that previous intermediate supervisions can serve as context information, and we use captions of visual start and goal observations as a much cheaper form of supervision. This approach greatly reduces the labeling cost since the captions can be easily obtained by large pre-trained vision-language foundation models. Technically, we apply BLIP to generate captions as supervision to train the context feature with contrastive learning loss. Afterward, the context feature is fed into the generator to aid in plan generation. Our experiments on two datasets with varying scales demonstrate that our model can achieve comparable performance on multiple metrics, which validates our hypothesis.

CI w/o TN: Context Injection without Task Name for Procedure Planning

TL;DR

This work tackles procedural planning in instructional videos under ultra-weak supervision by eliminating intermediate supervision and task names. It uses zero-shot BLIP-generated captions of the start and goal observations as context, training a context feature with a contrastive loss and injecting it into a context-augmented, memory-augmented transformer decoder to produce action sequences. Training losses include a contrastive term , a cross-entropy term for action prediction, and an adversarial to improve realism, while inference creates multiple plans with latent and refines via Viterbi post-processing using a learned transition matrix and emission . Experiments on CrossTask and COIN show competitive performance across multiple metrics, validating the hypothesis that intermediate supervision mainly serves as contextual information and that caption-based context can compensate for weaker supervision.

Abstract

This paper explores the challenge of procedure planning in instructional videos, which involves creating goal-directed plans based on visual start and goal observations from videos. Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision. However, with the advent of large language models, even given only the task name, these models can produce a detailed plan. In this study, we propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models since they require good prompts with sufficient information. Specifically, we hypothesize that previous intermediate supervisions can serve as context information, and we use captions of visual start and goal observations as a much cheaper form of supervision. This approach greatly reduces the labeling cost since the captions can be easily obtained by large pre-trained vision-language foundation models. Technically, we apply BLIP to generate captions as supervision to train the context feature with contrastive learning loss. Afterward, the context feature is fed into the generator to aid in plan generation. Our experiments on two datasets with varying scales demonstrate that our model can achieve comparable performance on multiple metrics, which validates our hypothesis.
Paper Structure (24 sections, 5 equations, 4 figures, 3 tables)

This paper contains 24 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of Different Types of Supervision in Procedural Learning: Language Instructions, Intermediate Visual Observations, and Task Class. In our project, we have removed all the intermediate supervision and instead rely on captions of visual start and goal as context information. This figure is modified from the work of wang2023pdpp.
  • Figure 2: Motivations for our approach. On the left, we observe that when provided with a task name, ChatGPT can generate a detailed plan. However, a proper prompt containing sufficient information is always required for ChatGPT to retrieve knowledge from large language models. In our project, the model is not aware of the task name, which makes the task more challenging and currently unsolved by large language models. On the right, we hypothesize that all the intermediate supervisions, such as language instructions, visual observations, and task class, serve as context information for plan generation. Based on this assumption, we propose a cost-effective approach to obtain context information by generating captions for visual start and goal observations using a pre-trained BLIP model li2022blip.
  • Figure 3: Overview of our model. Initially, visual observations of the starting and target states (represented by black nodes) are incorporated into the sequence of learned queries (represented by colored nodes). Then stochastic noise is injected into the resulting input sequence. Next, this input is fed into the transformer decoder, which interacts with the global memory to create executable procedural plans. Along with the memory, the context information transformed from the visual start and goal is injected into the transformer decoder. Subsequently, action vectors are generated, and multiple loss functions are employed, to train the model. Specifically, the context information is trained with contrastive learning loss and supervised by captions generated via BLIP li2022blip. Note that this figure is modified from zhao2022p3iv and our modifications are highlighted with red boxes.
  • Figure 4: Context augmented transformer decoder. In this architecture, the learned memory bank provides the multi-head cross-attention with K and V. In addition, the learned context feature is concatenated with the input as Q. Note that this figure is modified from zhao2022p3iv and our modification is highlighted with the red box.