Table of Contents
Fetching ...

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro, Giuseppe Averta

TL;DR

EgoPack is proposed, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed.

Abstract

Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks, to synergistically exploit them when learning novel skills. To accomplish this, we seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead, to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks, outperforming current state-of-the-art methods.

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

TL;DR

EgoPack is proposed, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed.

Abstract

Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks, to synergistically exploit them when learning novel skills. To accomplish this, we seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead, to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks, outperforming current state-of-the-art methods.
Paper Structure (35 sections, 3 equations, 7 figures, 7 tables)

This paper contains 35 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Given a video stream, a robot is asked to learn a novel task, e.g. Object State Change Classification (OSCC). To learn the new skill, the robot can access previously gained knowledge regarding different tasks, such as Point of No Return (PNR), Long Term Anticipation (LTA) and Action Recognition (AR), and use it during the learning process to enhance downstream task performance. This knowledge is stored as graphs inside the robot's backpack, always ready to boost a new skill.
  • Figure 2: Architecture of EgoPack when Action Recognition (AR) is the novel task. Videos are interpreted as a graph, whose nodes $\mathbf{x}_i$ represent actions, encoded as features, and edges connect temporally close segments. This representation enables the design of a Unified Temporal Backbone to learn multiple tasks with a shared architecture and minimal Task-Specific Heads, leveraging GNNs for temporal modelling. We exploit this architecture to jointly learn $K$ tasks, e.g. OSCC, LTA and PNR. After this training process, we extract a set of prototypes $\mathbf{P}^k$ that summarise what the network has learnt from each task $\mathcal{T}_k$, like a backpack of skills that we can carry over. In this Cross-Tasks Interaction phase, the network can peek at these different task-perspective to enrich the learning of the novel task.
  • Figure 3: Egocentric vision tasks as graph prediction tasks. In AR and LTA, each node is an action within a temporal sequence and the objective is to predict the verb and noun labels of the nodes. In OSCC and PNR, nodes represent different temporal segments of the video clip and the goal is to output a global prediction for the entire graph (OSCC) or the individual nodes (PNR).
  • Figure 4: Parameter analysis for the cross-tasks interaction module of EgoPack. We analyse the impact on performance of GNN depth and the number of nearest neighbours denoted as $k$-NN.
  • Figure 5: Closest nodes to the OSCC samples among AR and PNR task prototypes. Some nodes appear to be more discriminative of the presence or absence of an object state change.
  • ...and 2 more figures