Table of Contents
Fetching ...

Hierarchical Action Learning for Weakly-Supervised Action Segmentation

Junxian Huang, Ruichu Cai, Hao Zhu, Juntao Fang, Boyan Xu, Weilin Chen, Zijian Li, Shenghua Gao

TL;DR

Experimental results on several benchmarks show that the Hierarchical Action Learning model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.

Abstract

Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.

Hierarchical Action Learning for Weakly-Supervised Action Segmentation

TL;DR

Experimental results on several benchmarks show that the Hierarchical Action Learning model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.

Abstract

Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.
Paper Structure (52 sections, 6 theorems, 41 equations, 10 figures, 7 tables)

This paper contains 52 sections, 6 theorems, 41 equations, 10 figures, 7 tables.

Key Result

Lemma 1

(Block-wise Identification of $(\mathbf{v}_t, \mathbf{c}_t)$.) Suppose the observed, latent visual, and latent action variables follow the augmented data generation process in Figure fig:Model(a). By matching the true joint distribution of 5 numbers of adjacent video frames, i.e., $\{\mathbf{x}_{t-2 Suppose that the learned $(\hat{g}, \hat{f}, p_{\hat{\epsilon}})$ to achieve Equation (equ:x_gen) -

Figures (10)

  • Figure 1: An action segmentation example in the CrossTask dataset. (a) Real-world action videos exhibit different levels of representation, where action representations change more smoothly than visual representations. (b) A hierarchical data generation process of action videos, where high-level latent action variables govern the evolution of low-level visual variables. (c) Action segmentation results show that action-level segmentation better aligns with the ground truth than visual-level segmentation.
  • Figure 2: Illustration of the augmented data generation process and the framework of the HAL model. (a) The original data generation process is augmented by introducing pseudo-states and aligning the number of latent action with visual variables. The dashed arrows denote the unknown deterministic transitions. (b) The overall framework of HAL consists of a pyramidal transformer-based backbone for feature extraction, visual and action encoders for latent visual and action variables, visual and action decoder for reconstruction, and the smoothness transition constraint that enforces the identification of latent action variables.
  • Figure 3: Intuitions of the theoretical results, where the solid and dashed arrows denote the stochastic and deterministic transitions, respectively. (a) The identification of $(\mathbf{v}_t, \mathbf{c}_t)$ can be achieved by leveraging five consecutive temporal observations. (b) By introducing the independent noises $\epsilon^v_{t-1}, \epsilon^v_{t}, \epsilon^v_{t+1}$, the stochastic transition processes $\mathbf{v}_{t-2} \rightarrow \mathbf{v}_{t-1}$ and $\mathbf{v}_{t} \rightarrow \mathbf{v}_{t+1}$ can be effectively transformed into corresponding deterministic transition processes.
  • Figure 4: Qualitative results of P04-cam01-P04-pancake on the Breakfast dataset.
  • Figure 5: Illustration of T-SNE Visualization of latent variables. (a) and (b) show scatter plots of the latent action and visual variables of the propsoed method, respectively. (c) shows the output of ATBA after dimensionality reduction.
  • ...and 5 more figures

Theorems & Definitions (16)

  • Definition 1: Block-wise Identifiability of Latent Action $\mathbf{c}_t$ and Visual Variables $\mathbf{v}_t$ von2021self
  • Definition 2: Linear Operator hu2008instrumentaldunford1988linear
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Lemma 2
  • Lemma 3
  • proof
  • Theorem 3
  • proof
  • ...and 6 more