Table of Contents
Fetching ...

DeepKalPose: An Enhanced Deep-Learning Kalman Filter for Temporally Consistent Monocular Vehicle Pose Estimation

Leandro Di Bella, Yangxintong Lyu, Adrian Munteanu

TL;DR

DeepKalPose tackles temporal instability in monocular vehicle pose estimation by introducing a bi-directional, offline Kalman smoothing framework with a learnable motion model. It employs an encoder–decoder Future State Predictor and separate SFEM/STFEM feature extractors to capture both immediate state and spatio-temporal patterns, processed in chunks of length $\mathcal{T}$. The method replaces traditional KF gains with a learned RNN and uses a Conditional Output Block to bias toward higher-quality, near-camera samples, improving robustness to occlusion and distance. On KITTI, DeepKalPose outperforms state-of-the-art baselines in pose accuracy and temporal consistency, with notable improvements in ARED, rotation accuracy, and far- and occluded-vehicle scenarios, albeit as an offline method with planned online extensions.

Abstract

This paper presents DeepKalPose, a novel approach for enhancing temporal consistency in monocular vehicle pose estimation applied on video through a deep-learning-based Kalman Filter. By integrating a Bi-directional Kalman filter strategy utilizing forward and backward time-series processing, combined with a learnable motion model to represent complex motion patterns, our method significantly improves pose accuracy and robustness across various conditions, particularly for occluded or distant vehicles. Experimental validation on the KITTI dataset confirms that DeepKalPose outperforms existing methods in both pose accuracy and temporal consistency.

DeepKalPose: An Enhanced Deep-Learning Kalman Filter for Temporally Consistent Monocular Vehicle Pose Estimation

TL;DR

DeepKalPose tackles temporal instability in monocular vehicle pose estimation by introducing a bi-directional, offline Kalman smoothing framework with a learnable motion model. It employs an encoder–decoder Future State Predictor and separate SFEM/STFEM feature extractors to capture both immediate state and spatio-temporal patterns, processed in chunks of length . The method replaces traditional KF gains with a learned RNN and uses a Conditional Output Block to bias toward higher-quality, near-camera samples, improving robustness to occlusion and distance. On KITTI, DeepKalPose outperforms state-of-the-art baselines in pose accuracy and temporal consistency, with notable improvements in ARED, rotation accuracy, and far- and occluded-vehicle scenarios, albeit as an offline method with planned online extensions.

Abstract

This paper presents DeepKalPose, a novel approach for enhancing temporal consistency in monocular vehicle pose estimation applied on video through a deep-learning-based Kalman Filter. By integrating a Bi-directional Kalman filter strategy utilizing forward and backward time-series processing, combined with a learnable motion model to represent complex motion patterns, our method significantly improves pose accuracy and robustness across various conditions, particularly for occluded or distant vehicles. Experimental validation on the KITTI dataset confirms that DeepKalPose outperforms existing methods in both pose accuracy and temporal consistency.
Paper Structure (6 sections, 1 equation, 4 figures, 5 tables, 1 algorithm)

This paper contains 6 sections, 1 equation, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Schematic Overview of DeepKalPose.
  • Figure 2: (a) Schematic of the proposed EDLKF Architecture. (b) The Spatio-Temporal Feature Extraction Module. (c) Details of the State Feature Extraction Module (Top) and State Estimation Module (Bottom). $\mathcal{Z}^{-1}$ is a unit delay, $\otimes$ is multiplication operation, $\oplus$ is a sum operation and $\ominus$ is a substraction operation.
  • Figure 3: Qualitative results demonstrating improved car trajectory estimation on an occluded and distant vehicle. The sequence displays the projected 3D bounding boxes: the upper row illustrates results from Mono6D lyu2022mono6d, while the lower row illustrates results from our proposed methodology. The two figures on the extreme right detail the temporal Bird's Eye View (BEV) comparison between the 3D BB estimations by Mono6D and DeepKalPose, respectively.
  • Figure 4: The ARED of Mono6D and proposed method against distance. Solid lines represent the mean values, while the shaded areas indicate the variance.