Table of Contents
Fetching ...

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari

TL;DR

The paper addresses learning explicit, human-interpretable task graphs representing procedural activities from egocentric videos. It introduces a differentiable Task Graph Maximum Likelihood (TGML) loss that enables gradient-based learning of a DAG over key-steps, with two models: DO, which directly optimizes the adjacency weights, and TGT, which predicts graphs from textual or visual embeddings using transformers. Empirical results on CaptainCook4D show substantial gains in task-graph generation accuracy and competence in zero-shot video understanding tasks like pairwise ordering and future prediction; the learned graphs also significantly boost online mistake detection on Assembly101-O and EPIC-Tent-O datasets, outperforming several baselines. The work demonstrates the value of explicit, differentiable graph representations for procedural understanding and downstream decision-support tasks, and provides code for replication.

Abstract

Procedural activities are sequences of key-steps aimed at achieving specific goals. They are crucial to build intelligent agents able to assist users effectively. In this context, task graphs have emerged as a human-understandable representation of procedural activities, encoding a partial ordering over the key-steps. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, in this paper, we propose an approach based on direct maximum likelihood optimization of edges' weights, which allows gradient-based learning of task graphs and can be naturally plugged into neural network architectures. Experiments on the CaptainCook4D dataset demonstrate the ability of our approach to predict accurate task graphs from the observation of action sequences, with an improvement of +16.7% over previous approaches. Owing to the differentiability of the proposed framework, we also introduce a feature-based approach, aiming to predict task graphs from key-step textual or video embeddings, for which we observe emerging video understanding abilities. Task graphs learned with our approach are also shown to significantly enhance online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +7.5% on the Assembly101-O and EPIC-Tent-O datasets. Code for replicating experiments is available at https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

TL;DR

The paper addresses learning explicit, human-interpretable task graphs representing procedural activities from egocentric videos. It introduces a differentiable Task Graph Maximum Likelihood (TGML) loss that enables gradient-based learning of a DAG over key-steps, with two models: DO, which directly optimizes the adjacency weights, and TGT, which predicts graphs from textual or visual embeddings using transformers. Empirical results on CaptainCook4D show substantial gains in task-graph generation accuracy and competence in zero-shot video understanding tasks like pairwise ordering and future prediction; the learned graphs also significantly boost online mistake detection on Assembly101-O and EPIC-Tent-O datasets, outperforming several baselines. The work demonstrates the value of explicit, differentiable graph representations for procedural understanding and downstream decision-support tasks, and provides code for replication.

Abstract

Procedural activities are sequences of key-steps aimed at achieving specific goals. They are crucial to build intelligent agents able to assist users effectively. In this context, task graphs have emerged as a human-understandable representation of procedural activities, encoding a partial ordering over the key-steps. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, in this paper, we propose an approach based on direct maximum likelihood optimization of edges' weights, which allows gradient-based learning of task graphs and can be naturally plugged into neural network architectures. Experiments on the CaptainCook4D dataset demonstrate the ability of our approach to predict accurate task graphs from the observation of action sequences, with an improvement of +16.7% over previous approaches. Owing to the differentiability of the proposed framework, we also introduce a feature-based approach, aiming to predict task graphs from key-step textual or video embeddings, for which we observe emerging video understanding abilities. Task graphs learned with our approach are also shown to significantly enhance online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +7.5% on the Assembly101-O and EPIC-Tent-O datasets. Code for replicating experiments is available at https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.
Paper Structure (31 sections, 11 equations, 33 figures, 7 tables)

This paper contains 31 sections, 11 equations, 33 figures, 7 tables.

Figures (33)

  • Figure 1: (a) An example task graph encoding dependencies in a "mix eggs" procedure. (b) We learn a task graph which encodes a partial ordering between actions (left), represented as an adjacency matrix $Z$ (center), from input action sequences (right). The proposed Task Graph Maximum Likelihood (TGML) loss directly supervises the entries of the adjacency matrix $Z$ generating gradients to maximize the probability of edges from past nodes ($K_3, K_1$) to the current node ($K_2$), while minimizing the probability of edges from past nodes to future nodes ($K_4, K_5$) in a contrastive manner.
  • Figure 1: Task graph generation results on CaptainCook4D. Best results are in bold, second best results are underlined, best results among competitors are highlighted. Confidence interval bounds computed at $90\%$ conf. for $5$ runs.
  • Figure 2: Given a sequence $<S,A,B,D,C,E>$, and a graph $G$ with adjacency matrix $Z$, our goal is to estimate the likelihood $P(<S,A,B,D,C,E>|Z)$, which can be done by factorizing the expression into simpler terms. The figure shows an example of computation of probability $P(D|S,A,B,Z)$ as the ratio of the "feasibility of sampling key-step D, having observed key-steps S, A, and B" to the sum of all feasibility scores for unobserved symbols. Feasibility values are computed by summing weights of edges $D \to X$ for all observed key-steps $X$.
  • Figure 3: Our Task Graph Transformer (TGT) takes as input either $D$-dimensional text embeddings extracted from key-step names or video embeddings extracted from key-step segments. In both cases, we extract features with a pre-trained EgoVLPv2 model. For video embeddings, multiple embeddings can refer to the same action, so we randomly select one for each key-step (RS blocks). Learnable start (S) and end (E) embeddings are also included. Key-step embeddings are processed using a transformer encoder and regularized with a distinctiveness cross-entropy to prevent representation collapse. The output embeddings are processed by our relation head, which concatenates vectors across all $(n + 2)^2$ possible node pairs, producing $(n + 2) \times (n + 2) \times 2D$ relation vectors. These vectors are then processed by a relation transformer, which progressively maps them to an $(n + 2) \times (n + 2)$ adjacency matrix. The model is supervised with input sequences using our proposed Task Graph Maximum Likelihood (TGML) loss.
  • Figure 4: To further investigate the effect of noise, we conducted an analysis based on the controlled perturbation of ground truth action sequences, with the aim to simulate noise in the action detection process. At inference, we perturbed each key-step with a probability $\alpha$ (the "perturbation rate"), with three kinds of perturbations: insert (inserting a new key-step with a random action class), delete (deleting a key-step), or replace (randomly changing the class of a key-step). The plots show the trend of the F1 score (Average, Correct, and Mistake) as the perturbation rate increases in the case of Assembly101-O (left) and EPIC-Tent-O (right). Results suggest that the proposed approach can still bring benefits even in the presence of imperfect action detections, with the average F1 score dropping down $10-15$ points with a moderate noise level of $20\%$.
  • ...and 28 more figures