Table of Contents
Fetching ...

Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability

Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar

TL;DR

Quantitative saliency metrics show that the model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos, and panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.

Abstract

Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.

Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability

TL;DR

Quantitative saliency metrics show that the model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos, and panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.

Abstract

Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.
Paper Structure (60 sections, 8 equations, 16 figures, 7 tables)

This paper contains 60 sections, 8 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Comparing human gaze fixations (left) and model's attention maps (right) for 3 different videos (one per row). The memorability scores, ground-truth (GT) and model prediction (PR), are provided on the left. The heatmaps depict areas of high visual attention through warmer colors (red-yellow), indicating regions where human observers fixated (left) and model attended (right). The model's attention patterns are aligned with human gaze patterns, especially for more memorable videos. Samples from Memento10k Newman2020.
  • Figure 2: Model overview.$T$ video frames are passed through an image backbone encoder to obtain spatio-temporal features $\mathbf{F} \in \mathbb{R}^{T \times H \times W \times D}$. Coupled with position embeddings, and after appending a $\mathsf{CLS}$ token, we pass them through a Transformer encoder with self-attention. A memorability score is calculated at the $\mathsf{CLS}$ representation with an MLP. Attention scores between $\mathsf{CLS}$ and each token are used for downstream anaylsis.
  • Figure 3: Nearest neighbor (NN) analysis for videos from Memento10K (left) and VideoMem (right). We illustrate four validation set videos and for each, four NN from the training set. We provide the GT memorability score (below), the predicted score on the val set (above), and the average of 4 NN scores from the training set. In B (right), multiple video clips with high visual similarity between train and validation sets are highlighted with a yellow background. Conversely, the green rows highlight clips that have similar content, but are likely from different source videos. We discuss how data leakage and variance in GT scores may adversely affect evaluation in \ref{['subsec:videomem_challenging']}.
  • Figure 4: Analysis of panoptic segmentation for the most common 40 classes (20 stuff, 20 things). Left shows normalized pixel counts (blue), model attention-weighted counts (light blue), and human gaze-weighted counts (orange). Both, model and humans, show lower affinity for stuff classes and higher for thing classes, indicating their importance in memorability. Right Pixel counts are accumulated across stuff and thing classes, highlighting the above trend clearly. Best viewed on screen with zoom.
  • Figure 5: Gaze vs. attention similarity metrics with AUC-Judd scores on the Y-axis and Ground Truth on the X-Axis. (See supplement \ref{['subsec:metrics_and_complexity']}, \ref{['fig:metrics']} for other metrics and their trends.) Left: Memento10k, Right: VideoMem. Error bars depict SEMs.
  • ...and 11 more figures