Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
Fuxi Ling, Hongye Liu, Guoqiang Huang, Jing Li, Hong Wu, Zhihao Tang
TL;DR
This paper addresses person re-identification under challenging occlusions by introducing MOTAR-FUSE, a Motion-Aware Fusion network that derives motion cues from static images to enhance both image- and video-based ReID. The architecture couples a visual encoder with a visual adapter, a motion-aware transformer, and a fusion encoder to produce a unified representation via a hybrid class token, trained with a motion-consistency objective and standard cross-entropy and triplet losses. Key contributions include the motion consistency task, learnable query length analysis, and demonstration of strong results across holistic, occluded, and video datasets, showing robustness to real-world occlusion and dynamic scenarios. The approach advances practical ReID systems for urban surveillance by bridging static-image analysis with motion-aware dynamics, achieving state-of-the-art or competitive performance on multiple benchmarks while offering insights into pre-training and part-based feature integration.
Abstract
Navigating the complexities of person re-identification (ReID) in varied surveillance scenarios, particularly when occlusions occur, poses significant challenges. We introduce an innovative Motion-Aware Fusion (MOTAR-FUSE) network that utilizes motion cues derived from static imagery to significantly enhance ReID capabilities. This network incorporates a dual-input visual adapter capable of processing both images and videos, thereby facilitating more effective feature extraction. A unique aspect of our approach is the integration of a motion consistency task, which empowers the motion-aware transformer to adeptly capture the dynamics of human motion. This technique substantially improves the recognition of features in scenarios where occlusions are prevalent, thereby advancing the ReID process. Our comprehensive evaluations across multiple ReID benchmarks, including holistic, occluded, and video-based scenarios, demonstrate that our MOTAR-FUSE network achieves superior performance compared to existing approaches.
