Table of Contents
Fetching ...

Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Daniele Materia, Francesco Ragusa, Giovanni Maria Farinella

Abstract

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Abstract

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

Paper Structure

This paper contains 19 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Conceptual scheme of the VQA Interaction Anticipation task.
  • Figure 2: The proposed architecture for the human-object interaction anticipation task. It is composed of 4 main modules: 1) Set-of-Mark module, 2) Gaze module, 3) Sampling module, and 4) VLLM module.
  • Figure 3: The figure illustrates how the input frame is processed by the SoM and Gaze modules. The input RGB frame (a) is first processed by the SoM module (b) and the Gaze module (c). Their outputs are then fused (d) to obtain a visual representation that incorporates both spatial and intention-related information.
  • Figure 4: Visualization of our inverse exponential sampling strategy with $n=10$. Setting $\lambda=0$ (left) results in a uniform distribution, while $\lambda > 0$ (right) concentrates the sampled frames (the blue dots) on the instants immediately preceding the interaction. The blue dots represent the $n-1$ probabilistically sampled frames; the final frame, which is always part of the input, is omitted for clarity.
  • Figure 5: Qualitative example showing the positive impact of incorporating gaze information. (a) Classic VLLM, (b) proposed approach considering VLLM and SoM modules, (c) proposed approach considering VLLM and Gaze modules, (d) proposed approach considering VLLM, SoM and Gaze modules.
  • ...and 1 more figures