Table of Contents
Fetching ...

Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Zhou Chen, Joe Lin, Sathyanarayanan N. Aakur\\

TL;DR

PARSE addresses unsupervised learning of multiscale event structures from streaming video by using a hierarchical predictive cascade whose layers generate top-down context and bottom-up predictions. Hierarchical prediction errors drive emergent partonomies, with boundaries detected online from prediction-error dynamics and no labeled segmentations. Evaluations across Breakfast Actions, 50 Salads, and Assembly101 show state-of-the-art performance among streaming methods and competitive results with offline baselines, validating predictive learning under uncertainty as a scalable path to human-like temporal abstraction. The work introduces robust metrics for hierarchical structure (TED, hF1) and boundary detection (H-GEBD) to quantify both temporal precision and compositional coherence.

Abstract

Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

TL;DR

PARSE addresses unsupervised learning of multiscale event structures from streaming video by using a hierarchical predictive cascade whose layers generate top-down context and bottom-up predictions. Hierarchical prediction errors drive emergent partonomies, with boundaries detected online from prediction-error dynamics and no labeled segmentations. Evaluations across Breakfast Actions, 50 Salads, and Assembly101 show state-of-the-art performance among streaming methods and competitive results with offline baselines, validating predictive learning under uncertainty as a scalable path to human-like temporal abstraction. The work introduces robust metrics for hierarchical structure (TED, hF1) and boundary detection (H-GEBD) to quantify both temporal precision and compositional coherence.

Abstract

Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

Paper Structure

This paper contains 26 sections, 21 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Hierarchical Event Partonomy. Real-world activities unfold across multiple temporal and semantic scales, from long-horizon tasks (L4) decomposed into goals (L3), activities (L2), and atomic actions (L1). Each higher-level subsumes several temporally contained subevents, forming a structured partonomy.
  • Figure 2: PARSE Architecture Overview. At each timestep, the visual encoder extracts frame-level features ($f_t$) that feed into a hierarchy of recurrent predictors ($L_1$, $L_2$, $L_3 \ldots L_N$), each modeling temporal dynamics at a distinct scale. Each predictor generates a forward estimate ($\hat{h}_t^{(i)}$) of its lower-level hidden state ($h_t^{(i-1)}$) and computes a prediction error loss ($\mathcal{L}_{\text{pred}}^{(i)}$). Transient peaks in these hierarchical prediction errors (right) signal event transitions at progressively coarser temporal scales. Aggregating these peaks yields nested event boundaries, which form a partonomic structure capturing the coarse-to-fine organization of real-world activities.
  • Figure 3: Qualitative comparison of hierarchical partonomy predictions. We visualize the two best-matching levels (fine and coarse) from each predicted hierarchy for a video "Making coffee" from Breakfast Actions. PARSE produces temporally coherent and hierarchically consistent segments.
  • Figure 4: Ablation on dynamic hyperparameters. Effects of hidden-state size, top-down memory length, and sparsity regularization on boundary precision and recall at fine and coarse scales. Moderate configurations yield the most stable predictive dynamics.
  • Figure 5: Three-level partonomy predictions for simulated long videos for subjects P44 and P48 from the Breakfast Actions dataset.
  • ...and 2 more figures