Table of Contents
Fetching ...

Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos

Jianbo Ma, Hui Luo, Qi Chen, Yuankai Qi, Yumei Sun, Amin Beheshti, Jianlin Zhang, Ming-Hsuan Yang

TL;DR

AMOT addresses robust multi-object tracking in UAV-captured videos by jointly modeling appearance and motion through an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. AMC uses dense, appearance-guided response maps to compute bi-directional spatial distances for reliable detections-tracks affinity, while MTC reactivates unmatched tracks by reconciling appearance-guided predictions with Kalman-based motion. Deployed on a JDE backbone, AMOT achieves state-of-the-art IDF1 and MOTA across VisDrone2019, UAVDT, and VT-MOT-UAV benchmarks with real-time performance, and ablation studies confirm the additive benefits of AMC and MTC. The approach is plug-and-play and training-free for integration with existing trackers, underscoring its practical value for UAV-based surveillance and tracking tasks.

Abstract

Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.

Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos

TL;DR

AMOT addresses robust multi-object tracking in UAV-captured videos by jointly modeling appearance and motion through an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. AMC uses dense, appearance-guided response maps to compute bi-directional spatial distances for reliable detections-tracks affinity, while MTC reactivates unmatched tracks by reconciling appearance-guided predictions with Kalman-based motion. Deployed on a JDE backbone, AMOT achieves state-of-the-art IDF1 and MOTA across VisDrone2019, UAVDT, and VT-MOT-UAV benchmarks with real-time performance, and ablation studies confirm the additive benefits of AMC and MTC. The approach is plug-and-play and training-free for integration with existing trackers, underscoring its practical value for UAV-based surveillance and tracking tasks.

Abstract

Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.

Paper Structure

This paper contains 35 sections, 15 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: IDF1-MOTA-FPS comparisons of different methods on VisDrone2019. The radius of the circle denotes FPS. Our AMOT achieves the highest IDF1 of 61.4% and MOTA of 46.0%, with a real-time inference speed of 36.4 FPS.
  • Figure 2: Tracking pipeline of AMOT. Specifically, we introduce the appearance-motion consistency (AMC) matrix $\mathbf{C}_{AMC}$ that integrates it with the appearance similarity matrix $\mathbf{C}_{App}$ and the Intersection-over-Union (IOU) matrix $\mathbf{C}_{IOU}$ to robustly associate high-confidence detections $\mathcal{D}_{h}$ with tracks $\mathcal{T}$ at frame $t-1$. Then, the unmatched tracks $\mathcal{T}_{un}$ are associated with low-confidence detections $\mathcal{D}_{l}$ in the second stage. The remaining unmatched tracks $\mathcal{T}_{second\_un}$ are further potentially reactivated through our proposed motion-aware track continuation (MTC) module. $\textit{KF}$ means the Kalman Filter.
  • Figure 3: Overview of bi-directional spatial distances in AMC matrix. $\mathbf{A}_{trk}^{ (2)}$ and $\mathbf{A}_{det}^{ (1)}$ are the track-specific dense response map of T#2 and the detection-specific dense response map of D#1, respectively. $\mathbf{D}_{f}(2,1)$ represents the forward spatial distance from predicted center $\mathcal{Q}_{trk}^{ (2)}$ to observed center $\mathcal{O}_{det}^{ (1)}$, while $\mathbf{D}_{b}(1,2)$ denotes the backward spatial distance from predicted center $\mathcal{Q}_{det}^{ (1)}$ to observed center $\mathcal{O}_{trk}^{ (2)}$.
  • Figure 4: Visualization of tracking results with the MTC module. Despite missed detections, MTC effectively propagates tracks and maintains their correct identities.
  • Figure 5: Association performance comparison under different frame intervals for car category. Object displacement increases with frame interval.
  • ...and 4 more figures