Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang
TL;DR
This work tackles the challenge of detecting mistakes in egocentric video, where erroneous actions are rare and subtly expressed. It introduces DR-MoE, a two-stage architecture that first fuses features from a frozen ViViT backbone and a LoRA-tuned ViViT via a Feature Mixture-of-Experts, then ensembles three classifiers trained with different long-tailed objectives (reweighted cross-entropy, AUC loss, and a label-aware loss with SAM) using a Classification Mixture-of-Experts for adaptive fusion. The approach explicitly targets imbalanced data and fine-grained error cues, and demonstrates strong performance on the HoloAssist Mistake Detection benchmark, including competitive RGB-only results. The work provides a practical pathway for robust egocentric mistake detection and offers code to support reproducibility.
Abstract
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.
