Table of Contents
Fetching ...

Intention-Conditioned Long-Term Human Egocentric Action Forecasting

Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee

TL;DR

Intention-Conditioned Long-Term Human Egocentric Action Forecasting addresses the inherent uncertainty in predicting a sequence of future actions from egocentric video by leveraging a high-level intention as a guiding cue. The authors propose a two-module framework: a Hierarchical Multitask MLP Mixer (H3M) that extracts $N$ observed actions and the overall intention, and an Intention-Conditioned Variational Autoencoder (I-CVAE) that conditions future action generation on the inferred intention. The model produces $K$ stable predictions of the next $Z$ actions, and experiments on Ego4D show improved time-consistency and noun-level predictions, with ablations showing the value of intention conditioning. The work ranks first in CVPR@2022 and ECCV@2022 Ego4D LTA challenges and provides a practical pathway for intention-guided planning and human-robot collaboration.

Abstract

To anticipate how a human would act in the future, it is essential to understand the human intention since it guides the human towards a certain goal. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with Long-Term Action Anticipation task in egocentric videos. Our framework first extracts two level of human information over the N observed videos human actions through a Hierarchical Multi-task MLP Mixer (H3M). Then, we condition the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates K stable predictions of the next Z=20 actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over baseline methods in EGO4D Challenge. This work ranked first in both CVPR@2022 and ECVV@2022 EGO4D LTA Challenge by providing more plausible anticipated sequences, improving the anticipation of nouns and overall actions. Webpage: https://evm7.github.io/icvae-page/

Intention-Conditioned Long-Term Human Egocentric Action Forecasting

TL;DR

Intention-Conditioned Long-Term Human Egocentric Action Forecasting addresses the inherent uncertainty in predicting a sequence of future actions from egocentric video by leveraging a high-level intention as a guiding cue. The authors propose a two-module framework: a Hierarchical Multitask MLP Mixer (H3M) that extracts observed actions and the overall intention, and an Intention-Conditioned Variational Autoencoder (I-CVAE) that conditions future action generation on the inferred intention. The model produces stable predictions of the next actions, and experiments on Ego4D show improved time-consistency and noun-level predictions, with ablations showing the value of intention conditioning. The work ranks first in CVPR@2022 and ECCV@2022 Ego4D LTA challenges and provides a practical pathway for intention-guided planning and human-robot collaboration.

Abstract

To anticipate how a human would act in the future, it is essential to understand the human intention since it guides the human towards a certain goal. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with Long-Term Action Anticipation task in egocentric videos. Our framework first extracts two level of human information over the N observed videos human actions through a Hierarchical Multi-task MLP Mixer (H3M). Then, we condition the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates K stable predictions of the next Z=20 actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over baseline methods in EGO4D Challenge. This work ranked first in both CVPR@2022 and ECVV@2022 EGO4D LTA Challenge by providing more plausible anticipated sequences, improving the anticipation of nouns and overall actions. Webpage: https://evm7.github.io/icvae-page/
Paper Structure (16 sections, 1 equation, 6 figures, 4 tables)

This paper contains 16 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An example of the hierarchy structure of a human task. Egocentric sequence of videos of a human 'Working on milktea shop' (in purple, describing the high-level human intention) from Ego4d ego4d. In blue, a sequence of low-level actions labels performed by the camera wearer is shown. This paper proposes a methodology that understands a human task based on this hierarchical structure. Our model extracts a high-level human intention information and $N$ action labels from the observed sequence of $N$ clips (first row) to facilitate the anticipation of low-level $Z$ actions in its future (second row).
  • Figure 2: Overall proposed framework. Provided pre-extracted features for $N=4$ observed videos are fed to our Hierarchical Multitask MLP Mixer model (H3M) to obtain low-level action labels and high-level intention. Results are fed into our Intention-Conditioned Variational AutoEncoder (I-CVAE) that anticipates subsequent $Z=20$ actions.
  • Figure 3: Detailed structure of H3M architecture. First, pre-extracted padded features are fed into an Action MLP Mixer mlpmixer to obtain clip level features (as green circles). These features are used (i) to obtain verb-noun pair through a fully-connected pair (action head); (ii) to obtain a video representation through the Intention MLP Mixer which is classified as an intention class. Definition of Mixer Layer is inherited from mlpmixer.
  • Figure 4: Detailed structure of I-CVAE architecture, illustrating the encoder (top) and decoder (bottom) of our Transformer-based CVAE model. Given a sequence of $N+Z$ actions and an Intention label, the encoder outputs distribution parameters ($\hat{\mu}$ and $\hat{\Sigma}$) that encode all sequence information. Inspired by actor, extra learnable parameters per intention are used ($\mu$ and $\Sigma$) to obtain $\hat{\mu}$ and $\hat{\Sigma}$ and sample the latent future action representation $z \in \mathbb{R}^{M}$, where $M$ is the latent dimension of the Transformer. The decoder takes a latent vector $z$, the $N$ observed actions and the intention $I$ to output the representation sequence of $Z$ actions to anticipate. $I$ is used to determine the learnable $b$. Positional Encoder (PE) gives the time-component knowledge to the decoder. Finally, an Action Head compound by two fully-connected layers projects each action representation into a verb-noun pair.
  • Figure 5: Evaluation of I-CVAE trained based on different number of observed actions $N$.
  • ...and 1 more figures