Table of Contents
Fetching ...

Intention Action Anticipation Model with Guide-Feedback Loop Mechanism

Zongnan Ma, Fuchun Zhang, Zhixiong Nan, Yao Ge

TL;DR

A Hierarchical Complete-Recent information fusion model that makes full use of the features of the entire video sequence and the features of the video tail sequence to explore the rich interrelationships between multiscale complete features and multiscale recent features is proposed.

Abstract

Anticipating human intention from videos has broad applications, such as automatic driving, robot assistive technology, and virtual reality. This study addresses the problem of intention action anticipation using egocentric video sequences to estimate actions that indicate human intention. We propose a Hierarchical Complete-Recent (HCR) information fusion model that makes full use of the features of the entire video sequence (i.e., complete features) and the features of the video tail sequence (i.e., recent features). The HCR model has two primary mechanisms. The Guide-Feedback Loop (GFL) mechanism is proposed to model the relation between one recent feature and one complete feature. Based on GFL, the MultiComplete-Recent Feature Aggregation (MCRFA) module is proposed to model the relation of one recent feature with multiscale complete features. Based on GFL and MCRFA, the HCR model can hierarchically explore the rich interrelationships between multiscale complete features and multiscale recent features. Through comparative and ablation experiments, we validate the effectiveness of our model on two well-known public datasets: EPIC-Kitchens and EGTEA Gaze+.

Intention Action Anticipation Model with Guide-Feedback Loop Mechanism

TL;DR

A Hierarchical Complete-Recent information fusion model that makes full use of the features of the entire video sequence and the features of the video tail sequence to explore the rich interrelationships between multiscale complete features and multiscale recent features is proposed.

Abstract

Anticipating human intention from videos has broad applications, such as automatic driving, robot assistive technology, and virtual reality. This study addresses the problem of intention action anticipation using egocentric video sequences to estimate actions that indicate human intention. We propose a Hierarchical Complete-Recent (HCR) information fusion model that makes full use of the features of the entire video sequence (i.e., complete features) and the features of the video tail sequence (i.e., recent features). The HCR model has two primary mechanisms. The Guide-Feedback Loop (GFL) mechanism is proposed to model the relation between one recent feature and one complete feature. Based on GFL, the MultiComplete-Recent Feature Aggregation (MCRFA) module is proposed to model the relation of one recent feature with multiscale complete features. Based on GFL and MCRFA, the HCR model can hierarchically explore the rich interrelationships between multiscale complete features and multiscale recent features. Through comparative and ablation experiments, we validate the effectiveness of our model on two well-known public datasets: EPIC-Kitchens and EGTEA Gaze+.
Paper Structure (15 sections, 16 equations, 7 figures, 9 tables)

This paper contains 15 sections, 16 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Visualization of intention action anticipation. Anticipation time $\tau$ is how much in advance the intention action has to be anticipated.
  • Figure 2: Overview of Guide-Feedback Loop (GFL) mechanism. GFL comprises three stages: 1) complete feature is updated to generate a global guiding feature ($\boldsymbol{GGF}$); 2) global guiding feature guides the recent feature; 3) guided recent feature feeds back to the updated complete feature.
  • Figure 3: Overview of MultiComplete-Recent Feature Aggregation (MCRFA) module. The MCRFA module models one recent feature with multiple complete features.
  • Figure 4: Hierarchical complete-recent fusion prediction. 'V' represents a verb, 'N' denotes a verb, and 'A' expresses an action.
  • Figure 5: Attention visualization of the complete and recent features. Red boxes indicate ground truth regions. Initial represents the initial features. Single Attentioned denotes the results for which only a single self-attention mechanism is used (i.e., Equation \ref{['k1']} converts to $\boldsymbol{K_1}= Soft(Conv(\boldsymbol{C}_1))*\boldsymbol{C}_1)$). Dual Attentioned denotes the results generated by the dual self-attention mechanism (Equation \ref{['k1']}). Guided denotes the results of recent features guided by $\boldsymbol{GGF}$ (Equation \ref{['k2']})
  • ...and 2 more figures