Table of Contents
Fetching ...

Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

Qiaomu Miao, Alexandros Graikos, Jingwei Zhang, Sounak Mondal, Minh Hoai, Dimitris Samaras

TL;DR

This work tackles the costly and ambiguous process of annotating gaze targets by proposing a first semi-supervised gaze-following framework that fuses two priors: Grad-CAM heatmaps derived from prompting a pretrained VQA model and a diffusion-model-based annotation prior trained on labeled data. Grad-CAM heatmaps offer strong guidance but are noisy, which the diffusion refinement mitigates by producing pseudo-labels aligned with the training data distribution. The proposed Grad-CAM Diffusion Refinement (GCDR) method, including a Mean Teacher variant, yields consistent improvements over baselines on GazeFollow and VideoAttentionTarget, achieving notable annotation savings (e.g., 50% fewer labels) while maintaining or surpassing fully supervised performance in many settings. This approach enables scalable gaze-following models for both images and videos and suggests a general path for refining VL-derived priors into reliable supervision across semi-supervised tasks.

Abstract

Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by `prompting' the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the useful VQA prior information while exhibiting similar properties to the training data distribution. Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset. More importantly, our pseudo-annotation strategy, applied to a widely used supervised gaze following model (VAT), reduces the annotation need by 50%. Our method also performs the best on the VideoAttentionTarget dataset.

Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

TL;DR

This work tackles the costly and ambiguous process of annotating gaze targets by proposing a first semi-supervised gaze-following framework that fuses two priors: Grad-CAM heatmaps derived from prompting a pretrained VQA model and a diffusion-model-based annotation prior trained on labeled data. Grad-CAM heatmaps offer strong guidance but are noisy, which the diffusion refinement mitigates by producing pseudo-labels aligned with the training data distribution. The proposed Grad-CAM Diffusion Refinement (GCDR) method, including a Mean Teacher variant, yields consistent improvements over baselines on GazeFollow and VideoAttentionTarget, achieving notable annotation savings (e.g., 50% fewer labels) while maintaining or surpassing fully supervised performance in many settings. This approach enables scalable gaze-following models for both images and videos and suggests a general path for refining VL-derived priors into reliable supervision across semi-supervised tasks.

Abstract

Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by `prompting' the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the useful VQA prior information while exhibiting similar properties to the training data distribution. Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset. More importantly, our pseudo-annotation strategy, applied to a widely used supervised gaze following model (VAT), reduces the annotation need by 50%. Our method also performs the best on the VideoAttentionTarget dataset.
Paper Structure (29 sections, 5 equations, 9 figures, 11 tables)

This paper contains 29 sections, 5 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: (a) Gaze following annotation challenges. Annotating gaze is a laborious task with inherent ambiguities. (b) Pseudo annotations for gaze following. We generate pseudo annotations by first computing Grad-CAM heatmaps from a pre-trained VQA model, and then refining the noisy heatmaps with a diffusion model.
  • Figure 2: (a) Overall pipeline. We compute Grad-CAM heatmaps for unlabeled images and train the diffusion model with a small human-labeled set (or with unlabeled images using Mean Teacher). The diffusion model refines the Grad-CAM heatmaps into pseudo-annotations. Both the pseudo-annotations and the human-labeled set are used to train a gaze following model. (b) Grad-CAM heatmap generation. Given an image with an overlaid person bounding box, we "prompt" a pretrained VQA model with a gaze question and compute the Grad-CAM heatmap from the answer. (c) Grad-CAM refinement. We perturb the Grad-CAM heatmaps with Gaussian noise and pass through the reverse diffusion process to generate the refined pseudo-annotations.
  • Figure 3: Diffusion model training and refinement. (a) The diffusion model is trained on supervised data with noise added to the ground truth heatmap at random time steps. (b) During refinement, we add noise at a specific time step to the Grad-CAM heatmap. We treat this heatmap as an intermediate step input during the reverse process. Heatmaps are overlayed on the original images for illustration purposes. The conditional feature extraction for the diffusion model is omitted for simplicity.
  • Figure 4: Visualizations of pseudo heatmaps generated by different teachers. Our method generates the cleanest pseudo annotations while retaining the Grad-CAM heatmaps priors (Rows 1--3). When the initial Grad-CAM heatmap responds strongly to unlikely locations or is completely noisy, our method can also ignore it (Row 4).
  • Figure 5: Diffusion model output for noise added at different timesteps. Red dots represent the ground truth annotation. Adding noise at earlier steps generates outputs on high Grad-CAM response regions. Adding noise at later steps generates outputs similar to sampling from pure noise. Noise at step 250 is the best trade-off.
  • ...and 4 more figures