Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection
Xiang Fang, Arvind Easwaran, Blaise Genest
TL;DR
The paper tackles out-of-distribution action detection (ODAD) in untrimmed videos by designing UAAN, a dual-branch network that jointly leverages appearance and motion cues through spatial-temporal object graphs and an appearance-motion attention module. It introduces uncertainty-guided detection with an affinity-based action-background segmentation, evidential Beta-based action classification, and DIoU-aligned localization, all learned with a combined loss ${\mathcal L}_{final}$. Experiments on THUMOS14 and ActivityNet1.3 show UAAN substantially surpassing state-of-the-art methods across multiple metrics, validating the effectiveness of inter-object appearance-motion reasoning and uncertainty modeling for realistic video OOD tasks. The approach has practical implications for robust video understanding in dynamic environments like surveillance and autonomous systems.
Abstract
Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts, to prevent models trained on in-distribution (ID) dataset from producing unreliable predictions. Existing works only extract the appearance features on image datasets, and cannot handle dynamic multimedia scenarios with much motion information. Therefore, we target a more realistic and challenging OOD detection task: OOD action detection (ODAD). Given an untrimmed video, ODAD first classifies the ID actions and recognizes the OOD actions, and then localizes ID and OOD actions. To this end, in this paper, we propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN), which explores both appearance features and motion contexts to reason spatial-temporal inter-object interaction for ODAD.Firstly, we design separate appearance and motion branches to extract corresponding appearance-oriented and motion-aspect object representations. In each branch, we construct a spatial-temporal graph to reason appearance-guided and motion-driven inter-object interaction. Then, we design an appearance-motion attention module to fuse the appearance and motion features for final action detection. Experimental results on two challenging datasets show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.
