Table of Contents
Fetching ...

Attention-Aware Multi-View Pedestrian Tracking

Reef Alturki, Adrian Hilton, Jean-Yves Guillemaut

TL;DR

This paper tackles occlusion in multi-view pedestrian tracking by combining an early-fusion BEV representation with a cross-attention mechanism that propagates and aligns pedestrian features across frames. The approach uses an encoder–projection–decoder BEV pipeline, enhanced by a cross-attention module that operates on BEV tokens from neighboring frames and a 3D positional encoding to improve temporal associations. It introduces a robust affinity estimation for cross-frame matching and a deformable-convolution-based BEV decoder to better fuse multi-scale location and appearance cues. Evaluations on Wildtrack and MultiviewX show state-of-the-art IDF1 and competitive MOTA/MOTP, highlighting the practical impact of cross-view temporal reasoning for occlusion-heavy scenes.

Abstract

In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird's Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of $96.1\%$ on Wildtrack dataset, and $85.7\%$ on MultiviewX dataset.

Attention-Aware Multi-View Pedestrian Tracking

TL;DR

This paper tackles occlusion in multi-view pedestrian tracking by combining an early-fusion BEV representation with a cross-attention mechanism that propagates and aligns pedestrian features across frames. The approach uses an encoder–projection–decoder BEV pipeline, enhanced by a cross-attention module that operates on BEV tokens from neighboring frames and a 3D positional encoding to improve temporal associations. It introduces a robust affinity estimation for cross-frame matching and a deformable-convolution-based BEV decoder to better fuse multi-scale location and appearance cues. Evaluations on Wildtrack and MultiviewX show state-of-the-art IDF1 and competitive MOTA/MOTP, highlighting the practical impact of cross-view temporal reasoning for occlusion-heavy scenes.

Abstract

In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird's Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of on Wildtrack dataset, and on MultiviewX dataset.

Paper Structure

This paper contains 17 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the cross-attention module. It takes BEV features from two frames and extracts the features at pedestrian locations, which are used as tokens for the cross-attention processing. The module computes affinity scores between pedestrians in different frames and propagates the features across frames. represents positional encoding, while $\otimes$ and $\odot$ denote element-wise and matrix multiplication, respectively.
  • Figure 2: Overview of our model pipeline. The feature maps are extracted from the input views, projected onto a common ground plane, and aggregated to form a unified BEV feature, which is then passed into the decoder for further processing. The BEV features from two neighboring frames are then processed by the cross-attention module to efficiently establish associations and propagate information across frames.
  • Figure 3: Comparison of tracking performance between our approach and the baseline using the full Wildtrack test set. The dashed ovals highlight the areas where our model demonstrates improved tracking accuracy compared to the baseline model.