Table of Contents
Fetching ...

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, Qin Jin

TL;DR

EgoDTM tackles the lack of 3D understanding in egocentric video-language models by introducing a lightweight 3D-aware depth decoder guided by pseudo-depth from foundation models and a data construction pipeline that enriches captions with hand-object spatial cues. The method fuses dual transformer encoders with depth-aware pretraining and spatialized textual supervision (detect-track-generate) to learn depth-aware, text-aligned representations. Across zero-shot video-text retrieval, action recognition, depth estimation, and robotic manipulation tasks, EgoDTM shows consistent improvements over prior egocentric VLP methods, demonstrating stronger 3D-aware perception. The approach highlights the value of combining depth supervision, spatially enriched language, and foundation-model-driven data generation to advance 3D understanding in egocentric vision-and-language systems.

Abstract

Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Code: https://github.com/xuboshen/EgoDTM.

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

TL;DR

EgoDTM tackles the lack of 3D understanding in egocentric video-language models by introducing a lightweight 3D-aware depth decoder guided by pseudo-depth from foundation models and a data construction pipeline that enriches captions with hand-object spatial cues. The method fuses dual transformer encoders with depth-aware pretraining and spatialized textual supervision (detect-track-generate) to learn depth-aware, text-aligned representations. Across zero-shot video-text retrieval, action recognition, depth estimation, and robotic manipulation tasks, EgoDTM shows consistent improvements over prior egocentric VLP methods, demonstrating stronger 3D-aware perception. The approach highlights the value of combining depth supervision, spatially enriched language, and foundation-model-driven data generation to advance 3D understanding in egocentric vision-and-language systems.

Abstract

Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Code: https://github.com/xuboshen/EgoDTM.

Paper Structure

This paper contains 20 sections, 7 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Comparison of egocentric pretraining paradigms. While previous paradigms focus on text-based lin2022egovlppramanick2023egovlpv2lavila or 2D spatial region-aware learning zhang2023helping, EgoDTM incorporates 3D spatial information to enhance video representations.
  • Figure 2: EgoDTM learns 3D-aware representations from depth and text. Our dual encoders are constructed using only transformers dosovitskiy2020imagevaswani2017attentionclip with flash attention dao2022flashattention. During pretraining, we conduct (1) 3D-aware video pretraining: we design a lightweight 3D-aware decoder to predict depth using visual feature maps, supervised by a teacher foundation model depthanythingv2. The decoder contains a plain feature pyramid to get multi-scale features, a depth-aware transformer decoder to process depth queries with video features, and the heads to predict depth maps; (2) Spatial-aware textual enrichment: we enhance captions with spatial information by organically combining foundation models in the detect-track-generate pipeline. Different green markers denote inconsistency of HOI predictions; identical ones indicate consistency.
  • Figure 3: Qualitative results of depth estimation from EgoDTM and DepthAnythingv2 depthanythingv2(DAv2) on datasets including in-domain but unseen Ego4D validation set grauman2022ego4d, out-of-domain and unseen data of EK100 epic-kitchens-100, MECCANO ragusa2021meccano, and H2O Kwon2021h2o. Note that DAv2 operates with a high resolution of 512p, while EgoDTM uses a lower resolution input of 224p and generates a depth map at a resolution of 56p. Despite the lower resolution input, EgoDTM demonstrates intuitive generalization across unseen egocentric datasets with diverse environments, illuminations, backgrounds, and varying HOI object sizes.
  • Figure 4: Comparisons of the noisy HOI bounding boxes (left) and the spatial-temporal consistent HOI masks (right).
  • Figure 5: Example of generalizable data construction. For better visualizations, we blur the background to highlight HOI regions.
  • ...and 3 more figures