ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

Luigi Seminara; Davide Moltisanti; Antonino Furnari

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

Luigi Seminara, Davide Moltisanti, Antonino Furnari

TL;DR

ViterbiPlanNet is introduced, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL), which embeds a Procedural Knowledge Graph directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization.

Abstract

Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

TL;DR

Abstract

Paper Structure (83 sections, 34 equations, 17 figures, 16 tables)

This paper contains 83 sections, 34 equations, 17 figures, 16 tables.

Introduction
Related Work
Procedure Planning in Instructional Videos.
Evaluation of Procedural Planning Approaches.
Explicit Procedural Knowledge in Computer Vision.
Viterbi for Procedure Planning.
Method
Problem Formulation.
ViterbiPlanNet.
Encoding Procedural Knowledge.
Visual Encoding.
Emission Probabilities.
Structured Decoding.
Differentiable Viterbi Layer (DVL).
Training.
...and 68 more sections

Figures (17)

Figure 1: Given start and goal visual states, a neural model computes step-wise emissions. We propose a Differentiable Viterbi Layer that uses a Procedural Knowledge Graph (PKG) to decode emissions into a predicted plan. The layer allows gradients from the planning Loss ($\mathcal{L}$) to flow and train the neural model end-to-end, forcing it to learn structure-aware visual representations.
Figure 2: ViterbiPlanNet consists of four main stages: 1) Encoding Procedural Knowledge -- extracting PKGs from training data 2) Visual Encoding ($f_{enc}$) -- extracting features from start and goal frames, trained with the $\mathcal{L}_{align}$ and $\mathcal{L}_{task}$ losses, 3) Computing emission probabilities $b$ with $f_{emiss}$, and 4) Structured Decoding ($f_{vit}$) -- parametrized by the PKG, taking as input emission probabilities and outputting a soft plan $\tilde{\pi}$. Training with a plan loss $\mathcal{L}_{plan}$, gradients pass through Structured Decoding and optimize $f_{emiss}$.
Figure 3: Performance as a function of training data on CrossTask for $T=3$. ViterbiPlanNet is more parameter-efficient, as it does not need to memorize procedural knowledge.
Figure 4: Parameter Efficiency on CrossTask and NIV.
Figure 5: Qualitative comparison. The Base Model (top) learns to implicitly memorize the PKG, baking transition probabilities (arrows) directly into its predictions (circles). ViterbiPlanNet (bottom) learns smoother emissions decoupled from the graph, relying on the PKG's structural guidance for decoding.
...and 12 more figures

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

TL;DR

Abstract

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (17)