Table of Contents
Fetching ...

Lifting Multi-View Detection and Tracking to the Bird's Eye View

Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Herzog, Gerhard Rigoll

TL;DR

This work combines both branches and add new challenges to multi-view detection with cross-scene setups and presents an architecture that aggregates the features of multiple times steps to learn robust detection and combines appearance-and motion-based cues for tracking.

Abstract

Taking advantage of multi-view aggregation presents a promising solution to tackle challenges such as occlusion and missed detection in multi-object tracking and detection. Recent advancements in multi-view detection and 3D object recognition have significantly improved performance by strategically projecting all views onto the ground plane and conducting detection analysis from a Bird's Eye View. In this paper, we compare modern lifting methods, both parameter-free and parameterized, to multi-view aggregation. Additionally, we present an architecture that aggregates the features of multiple times steps to learn robust detection and combines appearance- and motion-based cues for tracking. Most current tracking approaches either focus on pedestrians or vehicles. In our work, we combine both branches and add new challenges to multi-view detection with cross-scene setups. Our method generalizes to three public datasets across two domains: (1) pedestrian: Wildtrack and MultiviewX, and (2) roadside perception: Synthehicle, achieving state-of-the-art performance in detection and tracking. https://github.com/tteepe/TrackTacular

Lifting Multi-View Detection and Tracking to the Bird's Eye View

TL;DR

This work combines both branches and add new challenges to multi-view detection with cross-scene setups and presents an architecture that aggregates the features of multiple times steps to learn robust detection and combines appearance-and motion-based cues for tracking.

Abstract

Taking advantage of multi-view aggregation presents a promising solution to tackle challenges such as occlusion and missed detection in multi-object tracking and detection. Recent advancements in multi-view detection and 3D object recognition have significantly improved performance by strategically projecting all views onto the ground plane and conducting detection analysis from a Bird's Eye View. In this paper, we compare modern lifting methods, both parameter-free and parameterized, to multi-view aggregation. Additionally, we present an architecture that aggregates the features of multiple times steps to learn robust detection and combines appearance- and motion-based cues for tracking. Most current tracking approaches either focus on pedestrians or vehicles. In our work, we combine both branches and add new challenges to multi-view detection with cross-scene setups. Our method generalizes to three public datasets across two domains: (1) pedestrian: Wildtrack and MultiviewX, and (2) roadside perception: Synthehicle, achieving state-of-the-art performance in detection and tracking. https://github.com/tteepe/TrackTacular
Paper Structure (13 sections, 2 equations, 4 figures, 4 tables)

This paper contains 13 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Lifting Methods. We compare three methods that lift the pixel information to 3D voxel space for detection and tracking.
  • Figure 2: Lifiting Methods. The three lifting methods we compare in this paper. The bilinear sampling method (b) simplifies the depth splat approach (a) without explicitly predicting the depth. Our method extends the bilinear sampling to only project image features if they intersect in the 3D volume. Thus, our method approximates the triangulation at voxel granularity.
  • Figure 3: Overview of Our Approach. The input views are encoded, and the resulting camera features are projected using one of three lifting methods. After aggregation, the feature is concatenated with the feature of the previous step. With the decoded feature, we predict the locations and offset to the location in the previous step. Additionally, we guide the architecture by predicting the object centers in the image features.
  • Figure 4: Qualitative Results. Detection example shown on Synthehicle. (a) shows the input images projected to the BEV space, (b) shows the ground truth heatmap of all vehicles, and (c) our prediction with bilinear sampling.