Table of Contents
Fetching ...

Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection

Xiang Fang, Arvind Easwaran, Blaise Genest

TL;DR

The paper tackles out-of-distribution action detection (ODAD) in untrimmed videos by designing UAAN, a dual-branch network that jointly leverages appearance and motion cues through spatial-temporal object graphs and an appearance-motion attention module. It introduces uncertainty-guided detection with an affinity-based action-background segmentation, evidential Beta-based action classification, and DIoU-aligned localization, all learned with a combined loss ${\mathcal L}_{final}$. Experiments on THUMOS14 and ActivityNet1.3 show UAAN substantially surpassing state-of-the-art methods across multiple metrics, validating the effectiveness of inter-object appearance-motion reasoning and uncertainty modeling for realistic video OOD tasks. The approach has practical implications for robust video understanding in dynamic environments like surveillance and autonomous systems.

Abstract

Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts, to prevent models trained on in-distribution (ID) dataset from producing unreliable predictions. Existing works only extract the appearance features on image datasets, and cannot handle dynamic multimedia scenarios with much motion information. Therefore, we target a more realistic and challenging OOD detection task: OOD action detection (ODAD). Given an untrimmed video, ODAD first classifies the ID actions and recognizes the OOD actions, and then localizes ID and OOD actions. To this end, in this paper, we propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN), which explores both appearance features and motion contexts to reason spatial-temporal inter-object interaction for ODAD.Firstly, we design separate appearance and motion branches to extract corresponding appearance-oriented and motion-aspect object representations. In each branch, we construct a spatial-temporal graph to reason appearance-guided and motion-driven inter-object interaction. Then, we design an appearance-motion attention module to fuse the appearance and motion features for final action detection. Experimental results on two challenging datasets show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.

Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection

TL;DR

The paper tackles out-of-distribution action detection (ODAD) in untrimmed videos by designing UAAN, a dual-branch network that jointly leverages appearance and motion cues through spatial-temporal object graphs and an appearance-motion attention module. It introduces uncertainty-guided detection with an affinity-based action-background segmentation, evidential Beta-based action classification, and DIoU-aligned localization, all learned with a combined loss . Experiments on THUMOS14 and ActivityNet1.3 show UAAN substantially surpassing state-of-the-art methods across multiple metrics, validating the effectiveness of inter-object appearance-motion reasoning and uncertainty modeling for realistic video OOD tasks. The approach has practical implications for robust video understanding in dynamic environments like surveillance and autonomous systems.

Abstract

Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts, to prevent models trained on in-distribution (ID) dataset from producing unreliable predictions. Existing works only extract the appearance features on image datasets, and cannot handle dynamic multimedia scenarios with much motion information. Therefore, we target a more realistic and challenging OOD detection task: OOD action detection (ODAD). Given an untrimmed video, ODAD first classifies the ID actions and recognizes the OOD actions, and then localizes ID and OOD actions. To this end, in this paper, we propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN), which explores both appearance features and motion contexts to reason spatial-temporal inter-object interaction for ODAD.Firstly, we design separate appearance and motion branches to extract corresponding appearance-oriented and motion-aspect object representations. In each branch, we construct a spatial-temporal graph to reason appearance-guided and motion-driven inter-object interaction. Then, we design an appearance-motion attention module to fuse the appearance and motion features for final action detection. Experimental results on two challenging datasets show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.
Paper Structure (10 sections, 11 equations, 5 figures, 3 tables)

This paper contains 10 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) Illustration of out-of-distribution (OOD) detection that only detects static images. (b) Previous temporal action detection models only classify/localize the ID actions and cannot detect OOD actions. (c) Our target task: OOD action detection that can not only classify and localize ID actions, but also detect and localize OOD actions. For the boxes and text, we color ID actions as blue and OOD actions as red, and background is not colored. The output labels on OOD actions are only for illustration, and they are not the actual output of the model. Best viewed in color.
  • Figure 2: Our motivation. (a) Example of inter-object interaction for action detection. (b) Existing action detection works only extracts frame-level motion information, and fails to distinguish similar motions "HighJump" and "LongJump". (c) We construct an object-aware graph to reason the inter-object interaction from the appearance and motion perspectives.
  • Figure 3: Overview of the proposed model for the challenging ODAD task. We first utilize video encoder (Faster R-CNN and I3D) to extract appearance- and motion-aware object features. Then, we design separate appearance and motion branches to reason the spatial-temporal interaction between different objects. Besides, we design an appearance-motion attention module to fully integrate the appearance and motion features for final detection. Best viewed in color.
  • Figure 4: Performance comparison for ID action classification and ODAD on THUMOS14 (left) and ActivityNet1.3 (right), where "ODAD(AUROC)" means "AUROC for ODAD" and "AC(Accuracy)" means "Classification accuracy for ID action classification". (a) is SoftMax, (b) is OpenMax bendale2016towards, (c) is DEAR BaoICCV2021, (d) is OpenTAL bao2022opental, and (e) is TFE-DCN(+) zhou2023temporal.
  • Figure 5: Analysis on hyper-parameters. Best viewed in color.