Table of Contents
Fetching ...

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Joochan Kim, Minjoon Jung, Byoung-Tak Zhang

TL;DR

This work identifies ordinal bias in instructional video action recognition, where models rely on dataset-specific action orders rather than visual understanding. It introduces Action Masking and Sequence Shuffling to stress-test robustness to nonstandard action sequences, revealing significant performance drops across multiple models and datasets. The findings indicate that bias persists even with additional training and that evaluation frameworks must evolve to assess true video comprehension. The paper suggests directions for robust modeling, balanced dataset design, and automatic detection of ordinal biases to improve real-world generalization.

Abstract

Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.

Exploring Ordinal Bias in Action Recognition for Instructional Videos

TL;DR

This work identifies ordinal bias in instructional video action recognition, where models rely on dataset-specific action orders rather than visual understanding. It introduces Action Masking and Sequence Shuffling to stress-test robustness to nonstandard action sequences, revealing significant performance drops across multiple models and datasets. The findings indicate that bias persists even with additional training and that evaluation frameworks must evolve to assess true video comprehension. The paper suggests directions for robust modeling, balanced dataset design, and automatic detection of ordinal biases to improve real-world generalization.

Abstract

Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.

Paper Structure

This paper contains 24 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Illustration of the ordinal bias. Due to the dominant action pair 'Take-Background', the model fails to predict the action 'Open.'
  • Figure 2: Long-tailed distributions of action pairs in datasets. Each dataset's histogram represents the frequency of action pairs, ranked by their occurrence count. The red-highlighted action pairs contribute to 30% of all actions in the dataset, despite being a small fraction of the total pairs. We only report up to top 50 action pairs in this visualization.
  • Figure 3: Manipulation techniques. Each video contains 943 frames. A single image represents consecutive frames, displayed in the top-left corner, while the action label is in the bottom-left corner. In the sequence shuffling, frames are shuffled in sequences, e.g., moving frames 12 to 70 to positions 0 to 58, and frames 695 to 698 to positions 848 to 852. In action masking, frames 12 to 70 are masked and labeled as background.
  • Figure 4: Heatmap of the frequency of action pairs with GTEA dataset. We use the initial action 'close' for action masking.
  • Figure 5: The results of model tested on the original and manipulated set. Consistent performance drops across different datasets and models suggest that current models rely on the ordinal bias in the datasets.
  • ...and 10 more figures