Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari
TL;DR
This work presents a differentiable framework, TGML, for learning directed acyclic task graphs from egocentric procedural sequences by casting graph learning as maximum-likelihood estimation. It introduces two models, DO, which directly optimizes the adjacency matrix, and TGT, a transformer-based predictor that maps text or video embeddings to graphs, both trained with a differentiable TGML loss that contrasts current and future pre-conditions. The approach achieves state-of-the-art task-graph generation across CaptainCook4D, EgoPER, and EgoProceL, and yields notable improvements on downstream tasks in Ego-Exo4D and online mistake detection benchmarks, with strong emergent video understanding capabilities in the TGT model. The authors also provide new EgoProceL graph annotations and publicly release code, enabling reproducibility and broader benchmarking in procedural video understanding.
Abstract
We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.
