Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Zhou Chen; Joe Lin; Sathyanarayanan N. Aakur\\

Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Zhou Chen, Joe Lin, Sathyanarayanan N. Aakur\\

TL;DR

PARSE addresses unsupervised learning of multiscale event structures from streaming video by using a hierarchical predictive cascade whose layers generate top-down context and bottom-up predictions. Hierarchical prediction errors drive emergent partonomies, with boundaries detected online from prediction-error dynamics and no labeled segmentations. Evaluations across Breakfast Actions, 50 Salads, and Assembly101 show state-of-the-art performance among streaming methods and competitive results with offline baselines, validating predictive learning under uncertainty as a scalable path to human-like temporal abstraction. The work introduces robust metrics for hierarchical structure (TED, hF1) and boundary detection (H-GEBD) to quantify both temporal precision and compositional coherence.

Abstract

Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

TL;DR

Abstract

Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)