Table of Contents
Fetching ...

Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang

TL;DR

This work tackles the challenge of detecting mistakes in egocentric video, where erroneous actions are rare and subtly expressed. It introduces DR-MoE, a two-stage architecture that first fuses features from a frozen ViViT backbone and a LoRA-tuned ViViT via a Feature Mixture-of-Experts, then ensembles three classifiers trained with different long-tailed objectives (reweighted cross-entropy, AUC loss, and a label-aware loss with SAM) using a Classification Mixture-of-Experts for adaptive fusion. The approach explicitly targets imbalanced data and fine-grained error cues, and demonstrates strong performance on the HoloAssist Mistake Detection benchmark, including competitive RGB-only results. The work provides a practical pathway for robust egocentric mistake detection and offers code to support reproducibility.

Abstract

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.

Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection

TL;DR

This work tackles the challenge of detecting mistakes in egocentric video, where erroneous actions are rare and subtly expressed. It introduces DR-MoE, a two-stage architecture that first fuses features from a frozen ViViT backbone and a LoRA-tuned ViViT via a Feature Mixture-of-Experts, then ensembles three classifiers trained with different long-tailed objectives (reweighted cross-entropy, AUC loss, and a label-aware loss with SAM) using a Classification Mixture-of-Experts for adaptive fusion. The approach explicitly targets imbalanced data and fine-grained error cues, and demonstrates strong performance on the HoloAssist Mistake Detection benchmark, including competitive RGB-only results. The work provides a practical pathway for robust egocentric mistake detection and offers code to support reproducibility.

Abstract

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.

Paper Structure

This paper contains 8 sections, 6 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: An overview of our DR-MoE method. Our method consists of two main stages. The first stage performs feature extraction by combining a frozen ViViT backbone and a LoRA-tuned ViViT backbone through a Feature Mixture-of-Experts (F-MoE) module. The second stage carries out classification by integrating the outputs of three independently trained classifiers using a Classification Mixture-of-Experts (C-MoE) module to generate the final prediction.