Transformer-Based Multi-Object Smoothing with Decoupled Data Association and Smoothing

Juliano Pinto; Georg Hess; Yuxuan Xia; Henk Wymeersch; Lennart Svensson

Transformer-Based Multi-Object Smoothing with Decoupled Data Association and Smoothing

Juliano Pinto, Georg Hess, Yuxuan Xia, Henk Wymeersch, Lennart Svensson

TL;DR

This work introduces D3AS, a transformer-based framework for multi-object smoothing that decouples data association (DDA) from trajectory smoothing (DS). The DDA predicts a soft association matrix $A\in\mathbb{R}^{n\times B}$ over measurements and tracks, which is partitioned to form per-track inputs for the DS module that outputs trajectory estimates $(\hat{\boldsymbol x}_{1:T}, p_{1:T}, \bar p)$. Training uses two dedicated losses: a Deep Data Associator Loss that aligns predictions to ground-truth associations via a permutation-invariant assignment, and a Deep Smoother Loss that maximizes the likelihood of ground-truth trajectories under a multi-Bernoulli density. Across ten tasks with varying clutter and detection probability, D3AS generally outperforms the model-based TPMBM, particularly in challenging scenarios where data association is hard, while offering better interpretability and faster convergence due to decoupling. The results validate the potential of transformer-based smoothing in low-dimensional measurement regimes and provide the first comparative study against Bayesian trackers in this smoothing context.

Abstract

Multi-object tracking (MOT) is the task of estimating the state trajectories of an unknown and time-varying number of objects over a certain time window. Several algorithms have been proposed to tackle the multi-object smoothing task, where object detections can be conditioned on all the measurements in the time window. However, the best-performing methods suffer from intractable computational complexity and require approximations, performing suboptimally in complex settings. Deep learning based algorithms are a possible venue for tackling this issue but have not been applied extensively in settings where accurate multi-object models are available and measurements are low-dimensional. We propose a novel DL architecture specifically tailored for this setting that decouples the data association task from the smoothing task. We compare the performance of the proposed smoother to the state-of-the-art in different tasks of varying difficulty and provide, to the best of our knowledge, the first comparison between traditional Bayesian trackers and DL trackers in the smoothing problem setting.

Transformer-Based Multi-Object Smoothing with Decoupled Data Association and Smoothing

TL;DR

This work introduces D3AS, a transformer-based framework for multi-object smoothing that decouples data association (DDA) from trajectory smoothing (DS). The DDA predicts a soft association matrix

over measurements and tracks, which is partitioned to form per-track inputs for the DS module that outputs trajectory estimates

. Training uses two dedicated losses: a Deep Data Associator Loss that aligns predictions to ground-truth associations via a permutation-invariant assignment, and a Deep Smoother Loss that maximizes the likelihood of ground-truth trajectories under a multi-Bernoulli density. Across ten tasks with varying clutter and detection probability, D3AS generally outperforms the model-based TPMBM, particularly in challenging scenarios where data association is hard, while offering better interpretability and faster convergence due to decoupling. The results validate the potential of transformer-based smoothing in low-dimensional measurement regimes and provide the first comparative study against Bayesian trackers in this smoothing context.

Abstract

Paper Structure (24 sections, 22 equations, 7 figures, 3 tables)

This paper contains 24 sections, 22 equations, 7 figures, 3 tables.

BACKGROUND ON TRANSFORMERS
Multihead Self-attention
Transformer Encoder
METHOD
Overview of DDA and DS modules
Deep Data Association Module
Partitioning
Deep Smoother Module
Losses
Deep Data Associator Loss
Deep Smoother Loss
EVALUATION SETTING
Task Description
Implementation Details
D3AS
...and 9 more sections

Figures (7)

Figure 1: The structure of the transformer encoder proposed in DETR, where $N$ encoder blocks are connected in series. The components of each encoder block are shown on the right.
Figure 2: Overview of the DDA+DS method. A scene containing a sequence of measurements is processed by the DDA, which produces a data association matrix. This is then used for partitioning (represented by the box P) the measurement sequence, and each partition is individually fed to the deep smoother module for predicting a track.
Figure 3: Structure of the Deep Data Associator module. A sequence of measurements $\mathbf z_{1:n}$ is processed by a transformer encoder that uses temporal encodings, followed by an individual application of FFN and Softmax layers to the computed embedding for each measurement.
Figure 4: Structure of the Deep Smoother module. A sequence of measurements from a track is processed into a state trajectory $\hat{\mathbf x}_{1:T}$, existence probabilities $p_{1:T}$, and a global trajectory existence probability $\bar{p}$.
Figure 5: Data association loss $\mathcal{L}_\text{DDA}$ during training of the DDA module. Lower values indicate better data association performance.
...and 2 more figures

Transformer-Based Multi-Object Smoothing with Decoupled Data Association and Smoothing

TL;DR

Abstract

Transformer-Based Multi-Object Smoothing with Decoupled Data Association and Smoothing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)