Table of Contents
Fetching ...

Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos

Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari

TL;DR

This work presents a differentiable framework, TGML, for learning directed acyclic task graphs from egocentric procedural sequences by casting graph learning as maximum-likelihood estimation. It introduces two models, DO, which directly optimizes the adjacency matrix, and TGT, a transformer-based predictor that maps text or video embeddings to graphs, both trained with a differentiable TGML loss that contrasts current and future pre-conditions. The approach achieves state-of-the-art task-graph generation across CaptainCook4D, EgoPER, and EgoProceL, and yields notable improvements on downstream tasks in Ego-Exo4D and online mistake detection benchmarks, with strong emergent video understanding capabilities in the TGT model. The authors also provide new EgoProceL graph annotations and publicly release code, enabling reproducibility and broader benchmarking in procedural video understanding.

Abstract

We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos

TL;DR

This work presents a differentiable framework, TGML, for learning directed acyclic task graphs from egocentric procedural sequences by casting graph learning as maximum-likelihood estimation. It introduces two models, DO, which directly optimizes the adjacency matrix, and TGT, a transformer-based predictor that maps text or video embeddings to graphs, both trained with a differentiable TGML loss that contrasts current and future pre-conditions. The approach achieves state-of-the-art task-graph generation across CaptainCook4D, EgoPER, and EgoProceL, and yields notable improvements on downstream tasks in Ego-Exo4D and online mistake detection benchmarks, with strong emergent video understanding capabilities in the TGT model. The authors also provide new EgoProceL graph annotations and publicly release code, enabling reproducibility and broader benchmarking in procedural video understanding.

Abstract

We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

Paper Structure

This paper contains 61 sections, 26 equations, 48 figures, 15 tables.

Figures (48)

  • Figure 1: (a) An example task graph encoding dependencies in a "mix eggs" procedure. (b) We learn a task graph which encodes a partial ordering between actions (left), represented as an adjacency matrix $Z$ (center), from input action sequences (right). The proposed Task Graph Maximum Likelihood (TGML) loss directly supervises the entries of the adjacency matrix $Z$ generating gradients to maximize the probability of edges from past nodes ($K_3, K_1$) to the current node ($K_2$), while minimizing the probability of edges from past nodes to future nodes ($K_4, K_5$) in a contrastive manner.
  • Figure 2: Given a sequence $<S,A,B,D,C,E>$, and a graph $G$ with adjacency matrix $Z$, our goal is to estimate the likelihood $P(<S,A,B,D,C,E>|Z)$, which can be done by factorizing the expression into simpler terms. The figure shows an example of computation of probability $P(D|S,A,B,Z)$ as the ratio of the "feasibility of sampling key-step D, having observed key-steps S, A, and B" to the sum of all feasibility scores for unobserved symbols. Feasibility values are computed by summing weights of edges $D \to X$ for all observed key-steps $X$.
  • Figure 3: Our Task Graph Transformer (TGT) takes as input either $D$-dimensional text embeddings extracted from key-step names or video embeddings extracted from key-step segments. In both cases, we extract features with a pre-trained EgoVLPv2 model. For video embeddings, multiple embeddings can refer to the same action, so we randomly select one for each key-step (RS blocks). Learnable start (S) and end (E) embeddings are also included. Key-step embeddings are processed using a transformer encoder and regularized with a distinctiveness cross-entropy loss (DCEL) to prevent representation collapse. The output embeddings are processed by our relation head, which concatenates vectors across all $(n + 2)^2$ possible node pairs, producing $(n + 2) \times (n + 2) \times 2D$ relation vectors. These vectors are then processed by a relation transformer, which progressively maps them to an $(n + 2) \times (n + 2)$ adjacency matrix. The model is supervised with input sequences using our proposed Task Graph Maximum Likelihood (TGML) loss.
  • Figure 4: An example of transitive dependency between nodes. In (a) node A depends on B and C, but B depends on C, in this case, we can remove the edge between A and C for transitivity and we obtain the graph in (b).
  • Figure 5: Example of a questionnaire item. Annotators can select multiple options. If annotators determine that a key-step has no pre-conditions, they were instructed to select "None of the above".
  • ...and 43 more figures