Table of Contents
Fetching ...

DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos

Lorenzo Mur-Labadia, Josechu Guerrero, Ruben Martinez-Cantin

TL;DR

Egocentric videos present dynamic wearer-object interactions and rapid camera motion that challenge static scene representations. The authors introduce DIV-FF, a triple-stream neural radiance field that decouples persistent environment, dynamic elements, and the actor, while fusing image-language features (CLIP) and video-language features (EgoVideo) with time-aware components. Key contributions include a three-stream NeRF-like geometry model with frame-specific codes, pixel-aligned CLIP features aided by SAM masks, and a video-language feature field guided by local/global supervision to capture affordances and action semantics; results show substantial gains in dynamic object segmentation (+40.5%) and affordance segmentation (+69.7%) on EPIC-Diff, as well as amodal scene understanding. Overall, DIV-FF enables consistent semantic decomposition over time, supports novel-view synthesis of egocentric scenes, and advances interaction-aware perception for robotics, AR, and assistive technologies.

Abstract

Environment understanding in egocentric videos is an important step for applications like robotics, augmented reality and assistive technologies. These videos are characterized by dynamic interactions and a strong dependence on the wearer engagement with the environment. Traditional approaches often focus on isolated clips or fail to integrate rich semantic and geometric information, limiting scene comprehension. We introduce Dynamic Image-Video Feature Fields (DIV FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor based components while integrating both image and video language features. Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time. DIV-FF outperforms state-of-the-art methods, particularly in dynamically evolving scenarios, demonstrating its potential to advance long term, spatio temporal scene understanding.

DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos

TL;DR

Egocentric videos present dynamic wearer-object interactions and rapid camera motion that challenge static scene representations. The authors introduce DIV-FF, a triple-stream neural radiance field that decouples persistent environment, dynamic elements, and the actor, while fusing image-language features (CLIP) and video-language features (EgoVideo) with time-aware components. Key contributions include a three-stream NeRF-like geometry model with frame-specific codes, pixel-aligned CLIP features aided by SAM masks, and a video-language feature field guided by local/global supervision to capture affordances and action semantics; results show substantial gains in dynamic object segmentation (+40.5%) and affordance segmentation (+69.7%) on EPIC-Diff, as well as amodal scene understanding. Overall, DIV-FF enables consistent semantic decomposition over time, supports novel-view synthesis of egocentric scenes, and advances interaction-aware perception for robotics, AR, and assistive technologies.

Abstract

Environment understanding in egocentric videos is an important step for applications like robotics, augmented reality and assistive technologies. These videos are characterized by dynamic interactions and a strong dependence on the wearer engagement with the environment. Traditional approaches often focus on isolated clips or fail to integrate rich semantic and geometric information, limiting scene comprehension. We introduce Dynamic Image-Video Feature Fields (DIV FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor based components while integrating both image and video language features. Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time. DIV-FF outperforms state-of-the-art methods, particularly in dynamically evolving scenarios, demonstrating its potential to advance long term, spatio temporal scene understanding.

Paper Structure

This paper contains 12 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: DIV-FF distills image and video language features in a triple stream feature field tailored to egocentric videos with numerous interactions and camera wearer movements. Our approach achieves a deep understanding of the environment, supporting precise affordance segmentation, semantic scene decomposition and consistent segmentation of dynamic objects. With its implicit 3D representation, DIV-FF comprehends not just novel views but also surrounding areas.
  • Figure 2: Overview of DIV-FF. Our three-stream architecture field predicts the color $c$, the density $\sigma$, the material aleatoric uncertainty $\beta$, the image-language features $\phi$ and the video-language features $\psi$ along a ray $r$ with direction $d$ given the camera viewpoint $g$ and a frame specific code $z$. We first extract SAM masks and bounding boxes from the image, that we leverage to obtain a unique CLIP descriptor $\phi_{GT}$ in all the pixels within the respective mask. We supervise the video-language feature field with local patch features $\psi^{GT}(V_p)$ and a global video embedding $\psi^{GT}(V)$ assigned only to pixels in the interaction hotspot $\mathcal{M}_{IH}$, computed with a pre-trained hand-object detector.
  • Figure 3: Ablations on the image-language feature field. Treating the egocentric video as a dynamic scene enhances geometric reconstruction, while utilizing SAM masks further improves object segmentation accuracy.
  • Figure 4: DIV-FF Image-Language relevancy maps in novel-views. We can see the performance of various text queries for dynamic object segmentation. We can see how the object contours are well defined as we used masks during training.
  • Figure 5: Consistent Dynamic Object Segmentation along different time-steps in novel views: The dynamic and actor streams contain respective frame-specific codes $z^f_t$ and$z^a_t$. This time encoding is also propagated to the semantic feature field, obtaining consistent segmentations despite the continuous movement of the "spatula" and "blue cutting board".
  • ...and 4 more figures