Table of Contents
Fetching ...

Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

Anupam Pani, Yanchao Yang

Abstract

Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.

Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

Abstract

Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.
Paper Structure (43 sections, 18 equations, 12 figures, 16 tables)

This paper contains 43 sections, 18 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Illustration of future event prediction. The input is a sequence of image frames, and the output is an action-related text prediction. Unlike the gaze-regularized model, the base model misidentifies the object to be picked up (bowl). Predicted future annotations for both models are shown on the right, with ground-truth annotations and immediate future frames displayed below for reference.
  • Figure 2: Overview of the Architecture A ViT encoder extracts image features, which are enhanced using gaze-based queries (obtained from gaze/psuedo gaze-overlaid images) in a gaze-regularized attention block. A Perceiver Resampler generates a fixed-size representation for the language module to predict text annotations. The gaze regularizer aligns model attention with human gaze patterns by minimizing the Kullback–Leibler divergence between the model attention distribution and gaze distribution during training.
  • Figure 3: Heatmap creation. Illustration of gaze data collection for generating gray-scale heatmaps and gaze-overlaid images. On the left, the aggregated gaze model incorporates multiple gaze points collected over the interval $[t-\delta, t]$ to generate the heatmap. On the right, the singular gaze model uses a single gaze point collected at time $t$. Both utilize Gaussian smoothing to generate the heatmap.
  • Figure 4: Future event prediction results for the base model (without gaze) and our gaze-regularized model are presented for an observation horizon ($\tau_o$) of 5 seconds. Past frames are omitted, but ground-truth annotations and future frames with a prediction duration ($\tau_a)$ of 2 seconds are provided as references. Keywords from each set of annotations are highlighted for easier reading
  • Figure 5: GPT-4V Iterative Prompting Workflow. The process begins with a sequence of images and an initial prompt, which are input into GPT-4V to generate annotations for each image. The user then evaluates these annotations and provides feedback, which is incorporated into the prompt using a language model. This modified prompt is used to refine the annotations in a continuous cycle. The objective is to improve the quality and relevance of the output with each iteration until the user is satisfied. For GPT-4V, a set of 10 images is provided at once, ensuring that the annotations maintain contextual coherence.
  • ...and 7 more figures