Table of Contents
Fetching ...

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppänen, Xiaobai Li

TL;DR

This work tackles subjective, noisy engagement labels in video data by using Vision-Language Models to refine annotations via a structured questionnaire and to partition data into Accepted and Rejected subsets. A two-stage curriculum learning framework with soft label refinement then guides training, allowing models to progressively learn from ambiguous samples while incorporating uncertainty. Empirical results across EngageNet, DREAMS, and PAFE show modest but robust improvements, highlighting increased resilience to label noise and annotation subjectivity. The approach offers a practical path to more reliable affective computing in engagement analysis, with potential for broader adoption in datasets with subjective labels.

Abstract

Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

TL;DR

This work tackles subjective, noisy engagement labels in video data by using Vision-Language Models to refine annotations via a structured questionnaire and to partition data into Accepted and Rejected subsets. A two-stage curriculum learning framework with soft label refinement then guides training, allowing models to progressively learn from ambiguous samples while incorporating uncertainty. Empirical results across EngageNet, DREAMS, and PAFE show modest but robust improvements, highlighting increased resilience to label noise and annotation subjectivity. The approach offers a practical path to more reliable affective computing in engagement analysis, with potential for broader adoption in datasets with subjective labels.

Abstract

Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

Paper Structure

This paper contains 33 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Dataset refinement through VLM analysis. Visualization of noisy vs. cleaned labels, illustrating VLM filtering on dataset quality. Markers represent dataset's different classes. In this scheme, “different classes” correspond to the four engagement levels used throughout our experiments ($0$: Not Engaged, $1$: Barely Engaged, $2$: Engaged, $3$: Highly Engaged), and the plot is illustrative only rather than taken from a specific dataset or VLM configuration.
  • Figure 2: Overview of our method for engagement analysis with subjective labels. A structured questionnaire guides a VLM to assess label reliability, generating informed outputs. It then uses two-stage curriculum learning, first training on high-confidence data and then introducing ambiguous samples with soft labels. This mitigates label noise and enhances model performance.
  • Figure 3: Two-stage curriculum learning scheme for engagement estimation: Stage $1$ uses high-confidence samples; Stage $2$ incorporates ambiguous samples.
  • Figure 4: Confusion matrices on the EngageNet validation set using Action Units + Head Pose features. (a) Baseline TCCT-Net; (b) TCCT-Net augmented with VLM-guided curriculum learning and soft-label refinement.
  • Figure 5: Confusion matrices on the EngageNet validation set using Action Units + Eye Gaze features. (a) Baseline TCCT-Net; (b) TCCT-Net augmented with VLM-guided curriculum learning and soft‐label refinement.