RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network
Boyue Xu, Yi Xu, Ruichao Hou, Jia Bei, Tongwei Ren, Gangshan Wu
TL;DR
This work addresses the need for real-time, robust RGB-D tracking by introducing HMAD, a Hierarchical Modality Aggregation and Distribution network built on a DIMP baseline. HMAD uses a two-stage architecture with CBAM-based shallow feature extraction and a hierarchical distribution/fusion module that effectively combines RGB texture and depth semantics from multiple feature levels. Ablation and extensive experiments on DepthTrack and RGBD1K demonstrate state-of-the-art accuracy with real-time edge-device performance (around 15 FPS), while real-world tests confirm robustness to occlusion, similar-target interference, and dim lighting. The approach offers practical impact for robotics and HCI by delivering high tracking reliability within the constraints of resource-limited platforms.
Abstract
The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD's capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.
