Table of Contents
Fetching ...

Learnable Online Graph Representations for 3D Multi-Object Tracking

Jan-Nico Zaech, Dengxin Dai, Alexander Liniger, Martin Danelljan, Luc Van Gool

TL;DR

This work presents a learnable online 3D MOT framework that unifies detections and track dynamics into a single graph and solves data association through Neural Message Passing (NMP). By embedding detection/track states and leveraging time-aware, relational messages, the method jointly handles data association, track initiation, and termination, replacing hand-crafted heuristics. A two-stage, semi-online training regime plus data augmentation aligns training and inference distributions, yielding state-of-the-art AMOTA of $0.656$ on nuScenes with substantially fewer ID-switches. The approach demonstrates robust generalization across detectors and offers a scalable platform for further integrating learnable track state representations in real-time autonomous systems.

Abstract

Tracking of objects in 3D is a fundamental task in computer vision that finds use in a wide range of applications such as autonomous driving, robotics or augmented reality. Most recent approaches for 3D multi object tracking (MOT) from LIDAR use object dynamics together with a set of handcrafted features to match detections of objects. However, manually designing such features and heuristics is cumbersome and often leads to suboptimal performance. In this work, we instead strive towards a unified and learning based approach to the 3D MOT problem. We design a graph structure to jointly process detection and track states in an online manner. To this end, we employ a Neural Message Passing network for data association that is fully trainable. Our approach provides a natural way for track initialization and handling of false positive detections, while significantly improving track stability. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.

Learnable Online Graph Representations for 3D Multi-Object Tracking

TL;DR

This work presents a learnable online 3D MOT framework that unifies detections and track dynamics into a single graph and solves data association through Neural Message Passing (NMP). By embedding detection/track states and leveraging time-aware, relational messages, the method jointly handles data association, track initiation, and termination, replacing hand-crafted heuristics. A two-stage, semi-online training regime plus data augmentation aligns training and inference distributions, yielding state-of-the-art AMOTA of on nuScenes with substantially fewer ID-switches. The approach demonstrates robust generalization across detectors and offers a scalable platform for further integrating learnable track state representations in real-time autonomous systems.

Abstract

Tracking of objects in 3D is a fundamental task in computer vision that finds use in a wide range of applications such as autonomous driving, robotics or augmented reality. Most recent approaches for 3D multi object tracking (MOT) from LIDAR use object dynamics together with a set of handcrafted features to match detections of objects. However, manually designing such features and heuristics is cumbersome and often leads to suboptimal performance. In this work, we instead strive towards a unified and learning based approach to the 3D MOT problem. We design a graph structure to jointly process detection and track states in an online manner. To this end, we employ a Neural Message Passing network for data association that is fully trainable. Our approach provides a natural way for track initialization and handling of false positive detections, while significantly improving track stability. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.

Paper Structure

This paper contains 38 sections, 17 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The proposed method uses a graph representation for detections and tracks. A neural message passing based architecture performs matching of detections and tracks and provides a learning based framework for track initialization, effectively replacing heuristics that are required in current approaches.
  • Figure 2: The proposed tracking graph combines tracks, represented by a sequence of track nodes and detections in a single representation. During the NMP iterations, information is exchanged between nodes and edges, and thus, distributed globally throughout the graph.
  • Figure 3: Visualization of different update scenarios, with only active edges in the graph. The graph represents a single track and two detections at each time step. a) Shows the ideal case where a track is matched to one node at every timestep and each detection node is connected with each other. b) Represents the case where a match at one timestep is dropped and the track is only matched to two detection nodes. c) Shows a situation, where the proposed approach is able to decide for the globally best solution, even though two detection nodes have been matched to the track in the last frame.
  • Figure 4: Qualitative results generated with our approach and projected into the $360^\circ$ images.
  • Figure 5: Qualitative results generated with our approach projected into the top-down view containing LIDAR points.