Table of Contents
Fetching ...

EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos

Masashi Hatano, Ryo Hachiuma, Hideo Saito

TL;DR

EMAG addresses the challenge of forecasting 2D hand positions from egocentric videos under strong ego-motion and background biases. It introduces a Transformer-based architecture that explicitly models ego-motion through a sequence of homography matrices and fuses trajectory, RGB, optical flow, and ego-motion modalities, with separate decoders for hand positions and ego-motion. On Ego4D and EPIC-Kitchens 55, EMAG achieves improved accuracy and strong cross-dataset generalization, outpacing prior methods by notable margins in ADE/FDE metrics. The approach offers a robust, generalizable framework with potential impact on AR/VR and human-robot interaction by enabling more reliable anticipation of hand actions in diverse first-person scenarios.

Abstract

Predicting future human behavior from egocentric videos is a challenging but critical task for human intention understanding. Existing methods for forecasting 2D hand positions rely on visual representations and mainly focus on hand-object interactions. In this paper, we investigate the hand forecasting task and tackle two significant issues that persist in the existing methods: (1) 2D hand positions in future frames are severely affected by ego-motions in egocentric videos; (2) prediction based on visual information tends to overfit to background or scene textures, posing a challenge for generalization on novel scenes or human behaviors. To solve the aforementioned problems, we propose EMAG, an ego-motion-aware and generalizable 2D hand forecasting method. In response to the first problem, we propose a method that considers ego-motion, represented by a sequence of homography matrices of two consecutive frames. We further leverage modalities such as optical flow, trajectories of hands and interacting objects, and ego-motions, thereby alleviating the second issue. Extensive experiments on two large-scale egocentric video datasets, Ego4D and EPIC-Kitchens 55, verify the effectiveness of the proposed method. In particular, our model outperforms prior methods by 1.7% and 7.0% on intra and cross-dataset evaluations, respectively. Project page: https://masashi-hatano.github.io/EMAG/

EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos

TL;DR

EMAG addresses the challenge of forecasting 2D hand positions from egocentric videos under strong ego-motion and background biases. It introduces a Transformer-based architecture that explicitly models ego-motion through a sequence of homography matrices and fuses trajectory, RGB, optical flow, and ego-motion modalities, with separate decoders for hand positions and ego-motion. On Ego4D and EPIC-Kitchens 55, EMAG achieves improved accuracy and strong cross-dataset generalization, outpacing prior methods by notable margins in ADE/FDE metrics. The approach offers a robust, generalizable framework with potential impact on AR/VR and human-robot interaction by enabling more reliable anticipation of hand actions in diverse first-person scenarios.

Abstract

Predicting future human behavior from egocentric videos is a challenging but critical task for human intention understanding. Existing methods for forecasting 2D hand positions rely on visual representations and mainly focus on hand-object interactions. In this paper, we investigate the hand forecasting task and tackle two significant issues that persist in the existing methods: (1) 2D hand positions in future frames are severely affected by ego-motions in egocentric videos; (2) prediction based on visual information tends to overfit to background or scene textures, posing a challenge for generalization on novel scenes or human behaviors. To solve the aforementioned problems, we propose EMAG, an ego-motion-aware and generalizable 2D hand forecasting method. In response to the first problem, we propose a method that considers ego-motion, represented by a sequence of homography matrices of two consecutive frames. We further leverage modalities such as optical flow, trajectories of hands and interacting objects, and ego-motions, thereby alleviating the second issue. Extensive experiments on two large-scale egocentric video datasets, Ego4D and EPIC-Kitchens 55, verify the effectiveness of the proposed method. In particular, our model outperforms prior methods by 1.7% and 7.0% on intra and cross-dataset evaluations, respectively. Project page: https://masashi-hatano.github.io/EMAG/
Paper Structure (25 sections, 9 equations, 6 figures, 6 tables)

This paper contains 25 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The presence of ego-motion in first-person videos significantly affects the dynamic movement of the camera position. Since the camera is part of the wearer's body, a variety of views can be captured even in a short period of time.
  • Figure 2: The architecture of the proposed method. Given input egocentric video frames, we pre-process them and obtain multiple modalities, including RGB and optical flow, detected bounding boxes of objects/hands, and homography matrices of adjacent frames. We train a single Transformer encoder and two Transformer decoders with MLP heads for hand and ego-motion prediction.
  • Figure 3: The accuracy drop comparison. The figure summarizes the accuracy drop percentage in the cross-dataset scenario from the accuracy in the intra-dataset scenario for each method. A lower value indicates that the performance does not drop by changing the scenario from intra-dataset to cross-dataset. We summarize the performance drop of the learning-based model as there is no performance degradation in non-learnable methods, such as CVM and KF.
  • Figure 4: Qualitative results. We present two sequences of predictions each from Ego4D and EPIC-Kitchens 55. Dots colored in green, red, blue, and yellow represent the hand positions of the ground truth, the proposed method, I3D + Regression, and OCT, respectively.
  • Figure 5: Scenario breakdown. The left pie chart represents the scenario breakdown on the validation set of the Ego4D dataset. There are eight categories in total, including inside/outside scenes. The right pie chart represents the scenario breakdown on the validation set of the EPIC-Kitchens 55 dataset. The EPIC-Kitchens 55 dataset contains only one category, cooking and activities in the kitchen.
  • ...and 1 more figures