Table of Contents
Fetching ...

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang

TL;DR

This work tackles the lack of fine-grained hand-object dynamics in egocentric video representation learning by introducing HOD, a data-generation pipeline that uses hand-object detectors and an LLM to produce rich, dynamics-infused captions. It then proposes EgoVideo, a ViT-based model with a lightweight motion adapter and a co-training scheme to efficiently learn these dynamics from high-framerate signals, achieving state-of-the-art results across multiple downstream tasks and showing strong generalization to robot manipulation. The approach demonstrates that enriching video-language pretraining with detailed hand-object interactions substantially improves performance in zero-shot and fine-tuned settings, with practical implications for embodied AI and assistive technologies. Code and data are publicly available, enabling broader adoption and further research into fine-grained egocentric understanding.

Abstract

In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. Code and data are available at https://github.com/OpenRobotLab/EgoHOD/.

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

TL;DR

This work tackles the lack of fine-grained hand-object dynamics in egocentric video representation learning by introducing HOD, a data-generation pipeline that uses hand-object detectors and an LLM to produce rich, dynamics-infused captions. It then proposes EgoVideo, a ViT-based model with a lightweight motion adapter and a co-training scheme to efficiently learn these dynamics from high-framerate signals, achieving state-of-the-art results across multiple downstream tasks and showing strong generalization to robot manipulation. The approach demonstrates that enriching video-language pretraining with detailed hand-object interactions substantially improves performance in zero-shot and fine-tuned settings, with practical implications for embodied AI and assistive technologies. Code and data are publicly available, enabling broader adoption and further research into fine-grained egocentric understanding.

Abstract

In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. Code and data are available at https://github.com/OpenRobotLab/EgoHOD/.

Paper Structure

This paper contains 21 sections, 4 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Left: Our EgoVideo model achieves state-of-the-art performance across multiple video benchmarks by learning fine-grained hand-object dynamics from videos. Right: Annotations from different sources: original Ego4D annotation grauman2022ego4d, LaViLa zhao2023learning, and our HOD. Our HOD annotations provide a detailed description of hand movements and object manipulation, demonstrating a higher level of detail and context.
  • Figure 2: Illustration of our HOD pipeline and EgoVideo model. In our Hand-Object Dynamics data generation pipeline (top), we first use a hand object detector to obtain the spatial coordinates of hands and objects in the clip, then we combine the motion information of hands and objects with the original narrations to generate semantically richer narrations. In our EgoVideo model (bottom), the backbone is trained with a lower framerate. We design a lightweight motion adapter to learn fine-grained dynamics efficiently with higher framerate inputs.
  • Figure 3: Normalized frequency of the Top-30 word in EgoClip (green) and our HOD (blue). Our HOD data has a less long-tail distribution, showing its word diversity.
  • Figure 4: Architecture of our motion adapter. We use a 2D convolution layer and a 1D temporal convolution layer to capture the spatial and temporal dynamics efficiently.
  • Figure 5: Examples of egocentric video/ego-like video and non-ego video.
  • ...and 4 more figures