Table of Contents
Fetching ...

Beyond Kalman Filters: Deep Learning-Based Filters for Improved Object Tracking

Momir Adžemović, Predrag Tadić, Andrija Petrović, Mladen Nikolić

TL;DR

This paper tackles the limitations of Kalman filters in tracking-by-detection, especially under nonlinear motion, by introducing two data-driven filtering paradigms: a Bayesian filter with a trainable nonlinear motion model and an end-to-end learnable filter. It explores multiple deep motion-model architectures—AR-RNN, RNN-CNP, ACNP, and RNN-ODE—and presents MoveSORT, a SORT-like tracker with a novel hybrid association cost that leverages these filters. Comprehensive experiments across DanceTrack, SportsMOT, MOT17, MOT20, and LaSOT show that the proposed filters consistently outperform $KF$-based baselines, with end-to-end filters offering strongest robustness to detector noise and missing detections, particularly in nonlinear motion regimes. The work demonstrates practical impact by enabling more reliable object tracking in challenging scenarios and provides a flexible framework that can replace conventional filters in tracking systems without bespoke detuning per detector.

Abstract

Traditional tracking-by-detection systems typically employ Kalman filters (KF) for state estimation. However, the KF requires domain-specific design choices and it is ill-suited to handling non-linear motion patterns. To address these limitations, we propose two innovative data-driven filtering methods. Our first method employs a Bayesian filter with a trainable motion model to predict an object's future location and combines its predictions with observations gained from an object detector to enhance bounding box prediction accuracy. Moreover, it dispenses with most domain-specific design choices characteristic of the KF. The second method, an end-to-end trainable filter, goes a step further by learning to correct detector errors, further minimizing the need for domain expertise. Additionally, we introduce a range of motion model architectures based on Recurrent Neural Networks, Neural Ordinary Differential Equations, and Conditional Neural Processes, that are combined with the proposed filtering methods. Our extensive evaluation across multiple datasets demonstrates that our proposed filters outperform the traditional KF in object tracking, especially in the case of non-linear motion patterns -- the use case our filters are best suited to. We also conduct noise robustness analysis of our filters with convincing positive results. We further propose a new cost function for associating observations with tracks. Our tracker, which incorporates this new association cost with our proposed filters, outperforms the conventional SORT method and other motion-based trackers in multi-object tracking according to multiple metrics on motion-rich DanceTrack and SportsMOT datasets.

Beyond Kalman Filters: Deep Learning-Based Filters for Improved Object Tracking

TL;DR

This paper tackles the limitations of Kalman filters in tracking-by-detection, especially under nonlinear motion, by introducing two data-driven filtering paradigms: a Bayesian filter with a trainable nonlinear motion model and an end-to-end learnable filter. It explores multiple deep motion-model architectures—AR-RNN, RNN-CNP, ACNP, and RNN-ODE—and presents MoveSORT, a SORT-like tracker with a novel hybrid association cost that leverages these filters. Comprehensive experiments across DanceTrack, SportsMOT, MOT17, MOT20, and LaSOT show that the proposed filters consistently outperform -based baselines, with end-to-end filters offering strongest robustness to detector noise and missing detections, particularly in nonlinear motion regimes. The work demonstrates practical impact by enabling more reliable object tracking in challenging scenarios and provides a flexible framework that can replace conventional filters in tracking systems without bespoke detuning per detector.

Abstract

Traditional tracking-by-detection systems typically employ Kalman filters (KF) for state estimation. However, the KF requires domain-specific design choices and it is ill-suited to handling non-linear motion patterns. To address these limitations, we propose two innovative data-driven filtering methods. Our first method employs a Bayesian filter with a trainable motion model to predict an object's future location and combines its predictions with observations gained from an object detector to enhance bounding box prediction accuracy. Moreover, it dispenses with most domain-specific design choices characteristic of the KF. The second method, an end-to-end trainable filter, goes a step further by learning to correct detector errors, further minimizing the need for domain expertise. Additionally, we introduce a range of motion model architectures based on Recurrent Neural Networks, Neural Ordinary Differential Equations, and Conditional Neural Processes, that are combined with the proposed filtering methods. Our extensive evaluation across multiple datasets demonstrates that our proposed filters outperform the traditional KF in object tracking, especially in the case of non-linear motion patterns -- the use case our filters are best suited to. We also conduct noise robustness analysis of our filters with convincing positive results. We further propose a new cost function for associating observations with tracks. Our tracker, which incorporates this new association cost with our proposed filters, outperforms the conventional SORT method and other motion-based trackers in multi-object tracking according to multiple metrics on motion-rich DanceTrack and SportsMOT datasets.
Paper Structure (35 sections, 23 equations, 6 figures, 16 tables)

This paper contains 35 sections, 23 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Visualization of a Bayesian filter incorporating non-linear, deep learning-based motion models. The measurement buffer comprises detection measurements $\{\bm{x}_1, \bm{x}_2, \dots, \bm{x}_n\}$ at times $\{t_1, t_2, \dots, t_n\}$, representing the object's trajectory history. These measurements, along with the target time $t_{n+1}$ serve as input to the motion model. The motion model predicts a prior normal distribution $\mathcal{N}(\hat{\bm{\mu}}_{n+1}, \hat{\bm{\Sigma}}_{n+1})$, which is used by the tracker in the association step. The matched detection $\bm{x}_{n+1}$, accompanied by a heuristic-based measurement noise matrix $\bm{R}_{n+1}$, represents the measurement likelihood $\mathcal{N}(\bm{x}_{n+1}, \bm{R}_{n+1})$. Finally, Bayes' rule is applied to derive the posterior estimation $\mathcal{N}(\tilde{\bm{\mu}}_{n+1}, \tilde{\bm{\Sigma}}_{n+1})$ given the prediction prior and measurement likelihood.
  • Figure 2: Overview of the two-stage end-to-end filtering process. The differences with respect to Figure \ref{['fig:deep_bayes_filter']} are as follows. First, there is no requirement for providing a manually tuned measurement noise matrix $R_{n+1}$. Second, a neural network implicitly learns to filter out measurement noise, replacing the application of Bayes' rule.
  • Figure 3: Visualization of the NODEFilter architecture. The latent state $\bm{z}_{n}$ summarizes the trajectory up to time $t_{n}$. Initially, $\tilde{\bm{z}}_{0}$, a zero vector, represents an empty trajectory. The ODESolver is employed to extrapolate—or predict—the latent trajectory from $\tilde{\bm{z}}_{0}$ to $\hat{\bm{z}}_{1}$. Subsequently, the mean $\hat{\bm{\mu}}_{1}$ and covariance $\hat{\bm{\Sigma}}_{1}$ are derived from the latent representation $\hat{\bm{z}}_{1}$. Upon observing the first measurement $\bm{x}_{1}$, a GRU network produces the updated latent state $\tilde{\bm{z}}_{1}$ by filtering out the measurement noise, resulting in final estimates of $\tilde{\bm{\mu}}_{1}$ and $\tilde{\bm{\Sigma}}_{1}$. This procedure is iteratively repeated. In scenarios where a measurement is missing, the update step involving the GRU is omitted.
  • Figure 4: Evaluation of accuracy for all filters on LaSOT validation dataset for different Gaussian noise standard deviations (left---prior, right -- posterior). For exact metric values, refer to Appendix \ref{['appendix:lasot_filter_robustness']}, tables \ref{['tab:lasot_gauss_prior']} and \ref{['tab:lasot_gauss_posterior']}.
  • Figure 5: Evaluation of accuracy on the LaSOT validation dataset for different false negative probabilities (left---prior, right---posterior). For exact metric values, refer to Appendix \ref{['appendix:lasot_filter_robustness']} tables \ref{['tab:lasot_fn_table_prior']} and \ref{['tab:lasot_fn_table_posterior']}.
  • ...and 1 more figures