Table of Contents
Fetching ...

AL-GTD: Deep Active Learning for Gaze Target Detection

Francesco Tonini, Nicola Dall'Asen, Lorenzo Vaquero, Cigdem Beyan, Elisa Ricci

TL;DR

AL-GTD is proposed, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL) and outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime.

Abstract

Gaze target detection aims at determining the image location where a person is looking. While existing studies have made significant progress in this area by regressing accurate gaze heatmaps, these achievements have largely relied on access to extensive labeled datasets, which demands substantial human labor. In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. To achieve this, we propose AL-GTD, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL). Additionally, it utilizes pseudo-labeling to mitigate distribution shifts during the training phase. AL-GTD achieves the best of all AUC results by utilizing only 40-50% of the training data, in contrast to state-of-the-art (SOTA) gaze target detectors requiring the entire training dataset to achieve the same performance. Importantly, AL-GTD quickly reaches satisfactory performance with 10-20% of the training data, showing the effectiveness of our acquisition function, which is able to acquire the most informative samples. We provide a comprehensive experimental analysis by adapting several AL methods for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime. Code is available at https://github.com/francescotonini/al-gtd.

AL-GTD: Deep Active Learning for Gaze Target Detection

TL;DR

AL-GTD is proposed, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL) and outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime.

Abstract

Gaze target detection aims at determining the image location where a person is looking. While existing studies have made significant progress in this area by regressing accurate gaze heatmaps, these achievements have largely relied on access to extensive labeled datasets, which demands substantial human labor. In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. To achieve this, we propose AL-GTD, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL). Additionally, it utilizes pseudo-labeling to mitigate distribution shifts during the training phase. AL-GTD achieves the best of all AUC results by utilizing only 40-50% of the training data, in contrast to state-of-the-art (SOTA) gaze target detectors requiring the entire training dataset to achieve the same performance. Importantly, AL-GTD quickly reaches satisfactory performance with 10-20% of the training data, showing the effectiveness of our acquisition function, which is able to acquire the most informative samples. We provide a comprehensive experimental analysis by adapting several AL methods for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime. Code is available at https://github.com/francescotonini/al-gtd.
Paper Structure (23 sections, 8 equations, 6 figures, 5 tables)

This paper contains 23 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The performance of our AL-GTD compared to counterpart active learning approach AL-SSL elezi2022not and SOTA gaze target detectors: Tu et al. tu2022end and Tonini et al. Tonini_2023_ICCVtonini2022multimodal on the GazeFollow dataset Recasens2017. Our method consistently performs better than competitors and achieves SOTA AUC performance with half of the training data, proving the effectiveness of our acquisition function.
  • Figure 2: The illustration and the pseudocode of our AL-GTD. We begin by obtaining the augmented version $\mathbb{A}$ from the original version $\mathcal{I}$ of unlabeled samples $U_i$ at the current AL cycle $i$. $\mathcal{OD}$ extracts relevant objects $O$ in the scene from $I_{RGB}$, while $GTN\xspace$ processes both $I_{RGB}$ and $I_{D}$ of the scene and the crop of the head of the person of interest $I_{H}$. From $GTN\xspace$, the attention map $M_A$ and gaze heatmap $H_G$ are obtained. The outputs of $\mathcal{OD}$ and $GTN\xspace$ are used to build the acquisition function (Eq. \ref{['eq:score']}), composed of the objectness $\Gamma$, the scatteredness $\Sigma$, and the discrepancy $\Delta$ scores. The oracle annotates the most informative samples, while those with the lowest scatteredness (Eq. \ref{['eq:pse']}) are pseudo-labeled. Both the manually labeled samples by the oracle and the pseudo-labeled samples are added to the pool $L_{i+1}$, and $GTN\xspace$ is trained on the updated set. This process is repeated for a fixed number of iterations $N$ until the exhaustion of the labeling budget $\beta$.
  • Figure 3: Our proposed $GTN\xspace$. $\mathcal{S}$ and $\mathcal{D}$ process the scene RGB image $I_{RGB}$ and depth map $I_{D}$, respectively. The crop of the head of the person of interest $I_{H}$ is processed by a separate head branch $\mathcal{H}$, and ${D}_{A}$ projects the head features into the attention map $M_A$. The scene $f_{S}$ and depth $f_{D}$ features are multiplied by the attention map $M_A$ and processed by two separate encoders, $\mathcal{E_S}$ and $\mathcal{E_D}$, along with the head features $f_{H}$. Finally, the decoder $D_{H}$ processes the features of the encoders and generates the gaze heatmap $H_G$. To alleviate prediction inconsistency, we train on both the original version $\mathcal{I}$ and the augmented version $\mathbb{A}$ of each labeled sample.
  • Figure 4: Comparisons among AL methods on the GazeFollow Recasens2015 dataset. Left: Area Under the Curve (AUC) of the predicted gaze heatmap w.r.t. the ground truth (GT). Center and right: average and minimum distance between GT and the predicted gaze point. Our method, AL-GTD, consistently surpasses random sampling and other AL methods, demonstrating superior performance even with a small initial training dataset (3.7K samples, $\sim$3% of the original train split).
  • Figure 5: Comparisons among AL methods on the VideoAttentionTarget chong2020detecting dataset. Left: Area Under the Curve (AUC) of the predicted gaze heatmap w.r.t. the GT. Right: average distance between GT and predicted gaze point.
  • ...and 1 more figures