Table of Contents
Fetching ...

Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

Sadegh Rahmaniboldaji, Filip Rybansky, Quoc C. Vuong, Anya C. Hurlbert, Frank Guerin, Andrew Gilbert

TL;DR

A large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition, shows that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs.

Abstract

Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.

Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

TL;DR

A large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition, shows that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs.

Abstract

Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.
Paper Structure (30 sections, 6 equations, 19 figures, 6 tables)

This paper contains 30 sections, 6 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Research pipeline highlighting different processes, including data preparation and classification by human and AI classifiers and the final analysis stage.
  • Figure 2: Illustration of the reduction process. The ground-truth action label for the video is close. Panel A undergoes three successive levels of spatial reduction, resulting in Quadrant B, which is correctly recognised by 65% of participants. Quadrant B serves as the parent quadrant for four spatially reduced child quadrants, Upper-Left, Bottom-Left, Upper-Right, and Bottom-Right (C--F), each recognised by fewer than 50% of participants, thereby classifying them as sub-MIRCs and Quadrant B as the corresponding MIRC. In the spatiotemporal branch, the MIRC video is temporally scrambled, producing an additional child video (G), a spatiotemporal sub-MIRC, with a recognition rate of 40%.
  • Figure 3: The AI classifier (see \ref{['fig:research_pipeline']}) is described in greater detail, consisting of a video feature encoder augmented by a lightweight spatiotemporal side network integrated with a frozen, pre-trained vision backbone.
  • Figure 4: The human classifier (see \ref{['fig:research_pipeline']}) is presented in more detail, comprising two stages: classification and response cleaning.
  • Figure 5: Example segmentations illustrating the Active Hand (\ref{['fig:high_level_features-a']}), Active Object (\ref{['fig:high_level_features-b']}), and Contextual Objects (\ref{['fig:high_level_features-c']}) in a video depicting the action labelled as “hang gloves”.
  • ...and 14 more figures