Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

Praditha Alwis; Soumyadeep Chandra; Deepak Ravikumar; Kaushik Roy

Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

Praditha Alwis, Soumyadeep Chandra, Deepak Ravikumar, Kaushik Roy

TL;DR

This work tackles annotation errors in temporally labeled videos, notably mislabeling and phase disordering, by introducing Cumulative Sample Loss (CSL), a loss-trajectory-based metric computed from per-frame losses across training checkpoints. The method is model-agnostic and operates post hoc without ground-truth corruption masks, flagging frames with persistently high or erratic CSL as potential errors; smoothing and sequence-level CSL further localize issues around phase transitions. Empirical evaluation on Cholec80 and EgoPER shows state-of-the-art frame-level AUC and strong segment-level detection, with the approach providing interpretable loss dynamics that pinpoint temporal and semantic inconsistencies. The framework is scalable, simple to integrate, and highlights how a model’s own learning difficulty can serve as a robust diagnostic signal for data quality in complex, temporally structured video datasets.

Abstract

High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as *mislabeling*, where segments are assigned incorrect class labels, and *disordering*, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)--defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.

Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

TL;DR

Abstract

Paper Structure (40 sections, 8 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 40 sections, 8 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Label Noise and Error Detection:
Self-Correction and Unsupervised Approaches:
Noisy Supervision and Unlearning in Video:
Methodology
Problem Setup:
Overview of the Two-Stage Framework:
Model Architecture
Feature Extractor:
Temporal Segmentation Backbone.
Classifier Head.
Training Objective and Checkpoint Saving
Loss Trajectories and Cumulative Sample Loss
Smoothing and Sequence-Level Signals:
...and 25 more sections

Figures (9)

Figure 1: Overview of our CSL-based framework for detecting annotation errors in video datasets. Top right: Frames with correct labels exhibit low and stable sample losses (Easy), while mislabeled or ambiguous frames (Hard, Mislabel) maintain high or erratic loss trajectories. Bottom right: Sequence-level CSL reveals phase disordering-correctly labeled sequences show smooth loss trends aligned with phase boundaries, whereas corrupted sequences exhibit temporal inconsistencies and abrupt transitions in loss curves.
Figure 2: (a) Training: A ResNet-18 Feature Extractor (FE) extracts frame features, which are passed through the ViT-B/16-based LossFormer for action segmentation. (b) Inference: CSL is computed from per-frame loss over training epochs and used to detect mislabeled or disordered frames.
Figure 3: Qualitative results of error detection on EgoPER and Cholec80. The top row shows representative frames, while the bottom row shows the cumulative sample loss trajectory across time. Our method accurately identifies corrupted regions (red spikes) using a task-specific threshold $\tau$.
Figure 4: Visualization of error detection across EgoPER dataset test sequences. Green indicates correct frames, red denotes detected annotation errors. Each row represents a different method, with our method (bottom row) shows precise localization of mislabelled or disordered segments.
Figure 5: Visualization of Cumulative Sample Loss (CSL) trends across training epochs. Left: Correctly labeled data exhibits low CSL loss concentration. Center: Label misannotations correspond to high CSL spikes (red boxes). Right: Disordering induces high CSL activations, especially around phase transition boundaries.
...and 4 more figures

Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

TL;DR

Abstract

Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

Authors

TL;DR

Abstract

Table of Contents

Figures (9)