Table of Contents
Fetching ...

Adaptive Learned State Estimation based on KalmanNet

Arian Mehrfard, Bharanidhar Duraisamy, Stefan Haag, Florian Geiss, Mirko Mählisch

Abstract

Hybrid state estimators that combine model-based Kalman filtering with learned components have shown promise on simulated data, yet their performance on real-world automotive data remains insufficient. In this work we present Adaptive Multi-modal KalmanNet (AM-KNet), an advancement of KalmanNet tailored to the multi-sensor autonomous driving setting. AM-KNet introduces sensor-specific measurement modules that enable the network to learn the distinct noise characteristics of radar, lidar, and camera independently. A hypernetwork with context modulation conditions the filter on target type, motion state, and relative pose, allowing adaptation to diverse traffic scenarios. We further incorporate a covariance estimation branch based on the Josephs form and supervise it through negative log-likelihood losses on both the estimation error and the innovation. A comprehensive, component-wise loss function encodes physical priors on sensor reliability, target class, motion state, and measurement flow consistency. AM-KNet is trained and evaluated on the nuScenes and View-of-Delft datasets. The results demonstrate improved estimation accuracy and tracking stability compared to the base KalmanNet, narrowing the performance gap with classical Bayesian filters on real-world automotive data.

Adaptive Learned State Estimation based on KalmanNet

Abstract

Hybrid state estimators that combine model-based Kalman filtering with learned components have shown promise on simulated data, yet their performance on real-world automotive data remains insufficient. In this work we present Adaptive Multi-modal KalmanNet (AM-KNet), an advancement of KalmanNet tailored to the multi-sensor autonomous driving setting. AM-KNet introduces sensor-specific measurement modules that enable the network to learn the distinct noise characteristics of radar, lidar, and camera independently. A hypernetwork with context modulation conditions the filter on target type, motion state, and relative pose, allowing adaptation to diverse traffic scenarios. We further incorporate a covariance estimation branch based on the Josephs form and supervise it through negative log-likelihood losses on both the estimation error and the innovation. A comprehensive, component-wise loss function encodes physical priors on sensor reliability, target class, motion state, and measurement flow consistency. AM-KNet is trained and evaluated on the nuScenes and View-of-Delft datasets. The results demonstrate improved estimation accuracy and tracking stability compared to the base KalmanNet, narrowing the performance gap with classical Bayesian filters on real-world automotive data.

Paper Structure

This paper contains 30 sections, 20 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Architecture of AM-KNet. The four input features and the motion state encoding are expanded through fully connected layers (FC5, FC6, FC8) and concatenated into the shared backbone GRUs (GRU Q, GRU P). For each measurement component, a sensor-specific head (FC1, FC7, GRU S) extracts per-component features, which are refined by FC2 and FC3 and aggregated by the Kalman Gain module. The S Decoder produces the innovation covariance $S_k$ via Cholesky factorisation. FC4 combines the fused covariance with the GRU P output and feeds back into GRU P as a recurrent connection; its output is also passed to the B Decoder, which produces the learned noise term $B_k$ for the Joseph-form covariance update. The hypernetwork (bottom) maps the 27-dimensional context vector, comprising a target encoding and object type encoding, through two FC layers and separate gain and shift heads to produce an affine modulation of the pre-activations throughout the network. Modules enclosed by the grey dashed box are instantiated per sensor modality; all other modules are shared.
  • Figure 2: Camera and Lidar detected object centroid error distributions on the nuScenes Train dataset. The left side shows the errors and the right the total count of detections. In this visualization the forward direction of the ego vehicle is along the positive x-axis.
  • Figure 3: Overview of the AM-KNet training framework. Measurements from lidar, camera, and radar enter the state estimation system at timestep $k$. The prior state $\hat{x}_{k|k-1}$ is obtained through the prediction step and used to compute the predicted measurement, innovation, and the four input features. For each available sensor, AM-KNet receives the features along with the motion state encoding, sensor identifier, and context vector, and outputs the Kalman gain $W_k$, innovation covariance $S_k$, and learned noise term $B_k$. The state is updated via the Kalman gain and the covariance via Joseph's form. After processing all sensors sequentially, the posterior $\hat{x}_{k|k}$ and $P_{k|k}$ are fed back for the next timestep. Training is supervised by a weighted component-wise MSE loss and two Gaussian NLL terms on the estimation error and innovation, introduced in a staged schedule.
  • Figure 4: This figure shows a sample of the VoD dataset, showing the lidar and radar point clouds, ground truth in brown and estimates in green. The reference points for ground truths and estimates are visualized as diamond shapes. The ego car faces the top and the y-data is plotted along the plot x-axis. Lidar points are in blue and radar points visualized in purple.