Table of Contents
Fetching ...

Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking

Futian Wang, Fengxiang Liu, Xiao Wang

TL;DR

This work tackles robust multi-object tracking in crowded scenes by addressing occlusions and identity switches through spatio-temporal graph learning on adaptively mined key frames. It introduces a reinforcement learning–driven Key Frame Extraction (KFE) module to segment videos adaptively and an Intra-frame Feature Fusion (IFF) module with Graph Convolutional Networks to exchange contextual information among nearby objects within a frame. Together with a hierarchical integration that combines short-term and long-term trajectories, the method achieves state-of-the-art MOT17 performance, exemplified by a HOTA of $68.6$, an IDF1 of $81.0$, an AssA of $66.6$, and $IDS=893$ under the same detections. This approach enhances occlusion robustness and discrimination among similarly appearing objects, offering a scalable solution for high-performance multi-object tracking in real-world video analysis.

Abstract

In the realm of multi-object tracking, the challenge of accurately capturing the spatial and temporal relationships between objects in video sequences remains a significant hurdle. This is further complicated by frequent occurrences of mutual occlusions among objects, which can lead to tracking errors and reduced performance in existing methods. Motivated by these challenges, we propose a novel adaptive key frame mining strategy that addresses the limitations of current tracking approaches. Specifically, we introduce a Key Frame Extraction (KFE) module that leverages reinforcement learning to adaptively segment videos, thereby guiding the tracker to exploit the intrinsic logic of the video content. This approach allows us to capture structured spatial relationships between different objects as well as the temporal relationships of objects across frames. To tackle the issue of object occlusions, we have developed an Intra-Frame Feature Fusion (IFF) module. Unlike traditional graph-based methods that primarily focus on inter-frame feature fusion, our IFF module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects within a frame. This innovation significantly enhances target distinguishability and mitigates tracking loss and appearance similarity due to occlusions. By combining the strengths of both long and short trajectories and considering the spatial relationships between objects, our proposed tracker achieves impressive results on the MOT17 dataset, i.e., 68.6 HOTA, 81.0 IDF1, 66.6 AssA, and 893 IDS, proving its effectiveness and accuracy.

Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking

TL;DR

This work tackles robust multi-object tracking in crowded scenes by addressing occlusions and identity switches through spatio-temporal graph learning on adaptively mined key frames. It introduces a reinforcement learning–driven Key Frame Extraction (KFE) module to segment videos adaptively and an Intra-frame Feature Fusion (IFF) module with Graph Convolutional Networks to exchange contextual information among nearby objects within a frame. Together with a hierarchical integration that combines short-term and long-term trajectories, the method achieves state-of-the-art MOT17 performance, exemplified by a HOTA of , an IDF1 of , an AssA of , and under the same detections. This approach enhances occlusion robustness and discrimination among similarly appearing objects, offering a scalable solution for high-performance multi-object tracking in real-world video analysis.

Abstract

In the realm of multi-object tracking, the challenge of accurately capturing the spatial and temporal relationships between objects in video sequences remains a significant hurdle. This is further complicated by frequent occurrences of mutual occlusions among objects, which can lead to tracking errors and reduced performance in existing methods. Motivated by these challenges, we propose a novel adaptive key frame mining strategy that addresses the limitations of current tracking approaches. Specifically, we introduce a Key Frame Extraction (KFE) module that leverages reinforcement learning to adaptively segment videos, thereby guiding the tracker to exploit the intrinsic logic of the video content. This approach allows us to capture structured spatial relationships between different objects as well as the temporal relationships of objects across frames. To tackle the issue of object occlusions, we have developed an Intra-Frame Feature Fusion (IFF) module. Unlike traditional graph-based methods that primarily focus on inter-frame feature fusion, our IFF module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects within a frame. This innovation significantly enhances target distinguishability and mitigates tracking loss and appearance similarity due to occlusions. By combining the strengths of both long and short trajectories and considering the spatial relationships between objects, our proposed tracker achieves impressive results on the MOT17 dataset, i.e., 68.6 HOTA, 81.0 IDF1, 66.6 AssA, and 893 IDS, proving its effectiveness and accuracy.
Paper Structure (10 sections, 8 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 10 sections, 8 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between (a, b, c) existing algorithms and (d) our newly proposed MOT tracking framework.
  • Figure 2: In the top row of the image, the individuals represented are extracted from the objects within the MOT17-09 video sequence of the MOT17 dataset. It is worth noting that the girl with a high ponytail reappears multiple times, each time experiencing varying degrees of occlusion. In the second row, the two highlighted objects in the image exhibit a high degree of visual similarity and are located in close proximity to one another. To distinguish between these two objects, we propose utilizing contextual information from the surrounding objects.
  • Figure 3: An overview of our proposed multi-object tracking framework.