OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering

Alexander Schperberg; Yusuke Tanaka; Saviz Mowlavi; Feng Xu; Bharathan Balaji; Dennis Hong

OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering

Alexander Schperberg, Yusuke Tanaka, Saviz Mowlavi, Feng Xu, Bharathan Balaji, Dennis Hong

TL;DR

OptiState addresses the challenge of accurate state estimation for dynamic legged robots by integrating a model-based Kalman filter with MPC-derived ground reaction forces and a learning-assisted correction path. A GRU, fed by the KF output and a latent depth representation from a Vision Transformer, learns to compensate nonlinearities and provide uncertainty estimates, yielding robust trunk-state estimates across terrains. Hardware experiments show a 65% RMSE improvement over a state-of-the-art VIO SLAM baseline, highlighting the method's effectiveness and potential for real-time deployment. The work demonstrates the value of combining principled physics-based estimation with data-driven corrections and vision-informed context in legged robotics.

Abstract

State estimation for legged robots is challenging due to their highly dynamic motion and limitations imposed by sensor accuracy. By integrating Kalman filtering, optimization, and learning-based modalities, we propose a hybrid solution that combines proprioception and exteroceptive information for estimating the state of the robot's trunk. Leveraging joint encoder and IMU measurements, our Kalman filter is enhanced through a single-rigid body model that incorporates ground reaction force control outputs from convex Model Predictive Control optimization. The estimation is further refined through Gated Recurrent Units, which also considers semantic insights and robot height from a Vision Transformer autoencoder applied on depth images. This framework not only furnishes accurate robot state estimates, including uncertainty evaluations, but can minimize the nonlinear errors that arise from sensor measurements and model simplifications through learning. The proposed methodology is evaluated in hardware using a quadruped robot on various terrains, yielding a 65% improvement on the Root Mean Squared Error compared to our VIO SLAM baseline. Code example: https://github.com/AlexS28/OptiState

OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering

TL;DR

Abstract

Paper Structure (13 sections, 11 equations, 5 figures, 1 table)

This paper contains 13 sections, 11 equations, 5 figures, 1 table.

Introduction
Related Works
Methods
Problem Definition
Measurements
Model
Kalman Filter
Prediction
Update
Improving Estimation through Learning
Experimental Validation
Results
Limitations and Conclusion

Figures (5)

Figure 1: Top left shows the attached body frame (B), and world frame (W). The trunk state $\mathbf{x}$ is in the world frame, while footstep positions $\mathbf{p}$ is relative to the body. Ground reaction forces (blue arrow) from our MPC control policy are given by $\mathbf{f}$, also in world frame. We verify our algorithm, OptiState, on slippery surfaces (top right), incline (bottom left), and rough terrain (bottom right).
Figure 2: Overall state estimation architecture as described in Sec. \ref{['methods']}.
Figure 3: Transformer and Gated Recurrent Unit (GRU) network architecture. From $\mathbf{(A)}$, model 1 is the transformer model, model 2 is the GRU ($\delta$) that predicts the robot's trunk state and uncertainty of its own prediction. Input/output state and hidden layer sizes indicated by the numbers. Training loss of model 1 shown in ($\mathbf{B}$) and for model 2 in ($\mathbf{C}$). MSE is the loss function (see Sec. \ref{['learning']}).
Figure 4: Results during the online testing phase, as described in Sec. \ref{['Results']}. In $\mathbf{(A)}$ we show the state estimation for all state components from OptiState, VIO SLAM, and the ground truth. We show 4 distinct trajectories connected by solid lines to symbolize the various terrain under evaluation, such as flat, slippery, incline, and rough terrain. The RMSE results over all 4 trajectories and per state component are shown in $\mathbf{(B)}$, and includes OptiState without the Kalman filter input, or vision input, and the Kalman filter alone. Lastly, we show the percentage improvement of RMSE over the VIO SLAM baseline for each state in $\mathbf{(C)}$ per estimation algorithm shown in the first column.
Figure 5: Example of predicting the uncertainty of the GRU's (OptiState) $x$ position, or $\mu_{x}$. The shaded blue represents $\bar{{x}}\pm\mu_{x}$.

OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering

TL;DR

Abstract

OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering

Authors

TL;DR

Abstract

Table of Contents

Figures (5)