Table of Contents
Fetching ...

From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities

Shijia Feng, Michael Wray, Walterio Mayol-Cuevas

TL;DR

This work tackles real-time understanding of human struggles across diverse tasks by reframing struggle localization as online detection and introducing struggle anticipation within a causal, horizon-based framework. It adapts two transformer-based baselines (LSTR and CMeRT) to operate on partial observations and evaluates them on the EvoStruggle dataset, demonstrating strong performance in both detection and up to 2-second-ahead anticipation. The study conducts thorough generalization analyses across activities and tasks, examines the impact of skill evolution, and presents a runtime assessment showing near real-time operation (~20 FPS). Overall, the findings indicate that struggle patterns generalize beyond individual tasks and that proactive, real-time assistance could be feasible for adaptive robotics and training applications.

Abstract

Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.

From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities

TL;DR

This work tackles real-time understanding of human struggles across diverse tasks by reframing struggle localization as online detection and introducing struggle anticipation within a causal, horizon-based framework. It adapts two transformer-based baselines (LSTR and CMeRT) to operate on partial observations and evaluates them on the EvoStruggle dataset, demonstrating strong performance in both detection and up to 2-second-ahead anticipation. The study conducts thorough generalization analyses across activities and tasks, examines the impact of skill evolution, and presents a runtime assessment showing near real-time operation (~20 FPS). Overall, the findings indicate that struggle patterns generalize beyond individual tasks and that proactive, real-time assistance could be feasible for adaptive robotics and training applications.

Abstract

Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.

Paper Structure

This paper contains 20 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Online struggle detection and anticipation problem illustration.
  • Figure 2: The left plot shows the effect of anticipation time length using LSTR xu_long_2021, while the right shows the impact of anticipation interval with CMeRT Pang_2025_CVPR.
  • Figure 3: Online struggle detection and anticipation generalization with/without Activity Knowledge.
  • Figure 4: Heatmaps showing cross-activity generalization evaluation for online struggle detection (left) and struggle anticipation (right) in a zero-shot setting.
  • Figure 5: Online struggle detection and anticipation generalization with/without Task Knowledge.
  • ...and 4 more figures