DeepKalPose: An Enhanced Deep-Learning Kalman Filter for Temporally Consistent Monocular Vehicle Pose Estimation
Leandro Di Bella, Yangxintong Lyu, Adrian Munteanu
TL;DR
DeepKalPose tackles temporal instability in monocular vehicle pose estimation by introducing a bi-directional, offline Kalman smoothing framework with a learnable motion model. It employs an encoder–decoder Future State Predictor and separate SFEM/STFEM feature extractors to capture both immediate state and spatio-temporal patterns, processed in chunks of length $\mathcal{T}$. The method replaces traditional KF gains with a learned RNN and uses a Conditional Output Block to bias toward higher-quality, near-camera samples, improving robustness to occlusion and distance. On KITTI, DeepKalPose outperforms state-of-the-art baselines in pose accuracy and temporal consistency, with notable improvements in ARED, rotation accuracy, and far- and occluded-vehicle scenarios, albeit as an offline method with planned online extensions.
Abstract
This paper presents DeepKalPose, a novel approach for enhancing temporal consistency in monocular vehicle pose estimation applied on video through a deep-learning-based Kalman Filter. By integrating a Bi-directional Kalman filter strategy utilizing forward and backward time-series processing, combined with a learnable motion model to represent complex motion patterns, our method significantly improves pose accuracy and robustness across various conditions, particularly for occluded or distant vehicles. Experimental validation on the KITTI dataset confirms that DeepKalPose outperforms existing methods in both pose accuracy and temporal consistency.
