Table of Contents
Fetching ...

Infinite Gaze Generation for Videos with Autoregressive Diffusion

Jenna Kang, Colin Groth, Tong Wu, Finley Torrens, Patsorn Sangkloy, Gordon Wetzstein, Qi Sun

Abstract

Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

Infinite Gaze Generation for Videos with Autoregressive Diffusion

Abstract

Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ( 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

Paper Structure

This paper contains 29 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our training and inference pipelines. The input of each iteration consists of the n-dimensional input vector with gaze history (GT) and noise for future prediction, and video frames corresponding to the prediction window. During inference, the gaze history of this input vector is filled with former predictions in an autoregressive manner. Video latents are generated frame-by-frame with a modified version of UNISAL. The blue boxes indicate our modifications, while the orange boxes indicate the original architecture. The low-dimensional video latents are processed by the U-Net with cross-attention at each block.
  • Figure 2: Comparison of the trajectories of video gaze predictions with different methods. The image shows a representative frame selected from the video, but the path is generated for the entire duration of the video. Generated trajectories (color map indicates timestamps) are overlaid on the ground truth human gaze (black trajectories, which are also shown in the left column with the color map).
  • Figure 3: Temporal progression of video gaze predictions with different methods. Generated trajectories are overlaid on the ground-truth human gaze (black). Note the closer proximity between our method and the ground truth, indicating greater similarity compared with alternative approaches.
  • Figure 4: User study results. Each row corresponds to one video. Our model (blue) receives consistently more votes than TPP-Gaze (green).