Table of Contents
Fetching ...

Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh

TL;DR

This work tackles fine-grained keystep recognition in egocentric videos, a problem challenged by dynamic backgrounds and occlusions. It introduces MAGLEV, a graph-based framework that represents each keystep segment as a node and enables training-time integration of exocentric views to boost egocentric inference, including multimodal extensions with depth and narrations. MAGLEV demonstrates state-of-the-art performance on Ego-Exo4D, outperforming prior ego-only and ego-exo methods by substantial margins, while maintaining compute efficiency through sparse graphs and pre-extracted features. The approach offers a practical pathway to robust procedural understanding in egocentric settings and opens avenues for multimodal graph learning in long-form video understanding.

Abstract

Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.

Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

TL;DR

This work tackles fine-grained keystep recognition in egocentric videos, a problem challenged by dynamic backgrounds and occlusions. It introduces MAGLEV, a graph-based framework that represents each keystep segment as a node and enables training-time integration of exocentric views to boost egocentric inference, including multimodal extensions with depth and narrations. MAGLEV demonstrates state-of-the-art performance on Ego-Exo4D, outperforming prior ego-only and ego-exo methods by substantial margins, while maintaining compute efficiency through sparse graphs and pre-extracted features. The approach offers a practical pathway to robust procedural understanding in egocentric settings and opens avenues for multimodal graph learning in long-form video understanding.

Abstract

Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.
Paper Structure (21 sections, 6 figures, 6 tables)

This paper contains 21 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Graph-based representation learning for keystep recognition. Learning leverages additional exocentric views via multi-view alignment. Inference is always over the egocentric view only.
  • Figure 2: There are four data modalities integrated into the graph framework, depending on the experiment group. Narrations and object classes are computed only for the egocentric view, and depth maps are computed for each view. There are between three and five exocentric views per scenario.
  • Figure 3: Illustration of the possible edge connection types within the graph framework. For narration and object class modalities, there is a single node added and connected to the ego vision node. Depth graphs can contain several edge types: within-modality edges, cross-modality, and temporal. Edges that connect matching node types share weights (e.g., $E_{\text{depth-vision}}$, $E_{\text{depth-depth}}$, $E_{\text{text-vision}}$).
  • Figure 4: Detic object detection on frames.
  • Figure 5: Corresponding egocentric and exocentric frame depth maps.
  • ...and 1 more figures