Table of Contents
Fetching ...

Symbolic Regression on Sparse and Noisy Data with Gaussian Processes

Junette Hsin, Shubhankar Agarwal, Adam Thorpe, Luis Sentis, David Fridovich-Keil

TL;DR

The paper tackles the challenge of deriving analytic dynamical models from sparse and noisy data, where traditional SINDy struggles due to noisy derivatives. It introduces GPSINDy, a framework that denoises state measurements with Gaussian processes to obtain smooth trajectories $\boldsymbol{X}_{GP}$ and derivatives $\dot{\boldsymbol{X}}_{GP}$, then performs sparse symbolic regression via ADMM-LASSO on a candidate function library evaluated at $\boldsymbol{X}_{GP}$ and $\boldsymbol{U}$. Kernel selection is guided by marginal likelihood across multiple kernels, ensuring the denoising matches the data structure. The method is validated on Lotka-Volterra, unicycle dynamics, and NVIDIA JetRacer hardware data, showing consistently lower coefficient estimation error and trajectory RMSE than SINDy and neural-network baselines, especially under high noise and data sparsity. This approach enables robust, interpretable dynamical models suitable for robotics and control applications when data are limited or noisy.

Abstract

In this paper, we address the challenge of deriving dynamical models from sparse and noisy data. High-quality data is crucial for symbolic regression algorithms; limited and noisy data can present modeling challenges. To overcome this, we combine Gaussian process regression with a sparse identification of nonlinear dynamics (SINDy) method to denoise the data and identify nonlinear dynamical equations. Our approach GPSINDy offers improved robustness with sparse, noisy data compared to SINDy alone. We demonstrate its effectiveness on simulation data from Lotka-Volterra and unicycle models and hardware data from an NVIDIA JetRacer system. We show superior performance over baselines including more than 50% improvement over SINDy and other baselines in predicting future trajectories from noise-corrupted and sparse 5 Hz data.

Symbolic Regression on Sparse and Noisy Data with Gaussian Processes

TL;DR

The paper tackles the challenge of deriving analytic dynamical models from sparse and noisy data, where traditional SINDy struggles due to noisy derivatives. It introduces GPSINDy, a framework that denoises state measurements with Gaussian processes to obtain smooth trajectories and derivatives , then performs sparse symbolic regression via ADMM-LASSO on a candidate function library evaluated at and . Kernel selection is guided by marginal likelihood across multiple kernels, ensuring the denoising matches the data structure. The method is validated on Lotka-Volterra, unicycle dynamics, and NVIDIA JetRacer hardware data, showing consistently lower coefficient estimation error and trajectory RMSE than SINDy and neural-network baselines, especially under high noise and data sparsity. This approach enables robust, interpretable dynamical models suitable for robotics and control applications when data are limited or noisy.

Abstract

In this paper, we address the challenge of deriving dynamical models from sparse and noisy data. High-quality data is crucial for symbolic regression algorithms; limited and noisy data can present modeling challenges. To overcome this, we combine Gaussian process regression with a sparse identification of nonlinear dynamics (SINDy) method to denoise the data and identify nonlinear dynamical equations. Our approach GPSINDy offers improved robustness with sparse, noisy data compared to SINDy alone. We demonstrate its effectiveness on simulation data from Lotka-Volterra and unicycle models and hardware data from an NVIDIA JetRacer system. We show superior performance over baselines including more than 50% improvement over SINDy and other baselines in predicting future trajectories from noise-corrupted and sparse 5 Hz data.
Paper Structure (9 sections, 21 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 9 sections, 21 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: GPSINDy achieves lowest error learning model coefficients for the predator-prey and unicycle models. Contrast the dynamics learned by SINDy (blue), GPSINDy (orange), and NNSINDy (green) with trials repeated over 40 seeds for each noise standard deviation $\sigma$, which varies from $0.050$ to $0.25$. The vertical axis represents the mean-squared error between the ground-truth ($\boldsymbol{\Xi}_{\text{GT}}$) and the learned coefficients ($\boldsymbol{\Xi}_{\text{Learned}}$). The ribbon indicates the standard deviation around the mean line; lower is better.
  • Figure 2: GPSINDy outperforms baselines on NVIDIA JetRacer trajectories under noisy measurements. Each subplot represents the NVIDIA JetRacer dataset at different frequencies (Hz). The horizontal axis represents the noise level standard deviations and the vertical axis the log root mean-squared error (RMSE) between the predicted trajectory (from learned system dynamics) and ground-truth states. The ribbons indicate the upper and lower quartiles around the median $\log$ RMSE; lower overall error is better. Over 45 rollouts, although SSR Residual (blue) sometimes beats GPSINDy (orange), GPSINDy achieves the lowest RMSE for the most frequencies and noise levels.
  • Figure 3: GPSINDy Trajectories Align Closely with Ground Truth for the Real JetRacer System. Contrast the Cartesian trajectories predicted from SINDy (blue) and GPSINDy (orange) with the ground truth (black) based on one rollout out of the total collected JetRacer data. For this trajectory, the RMSE error norm between the $x_1$ and $x_2$ coordinates for SINDy on the testing data is $1.4m\squared$, while for GPSINDy it is reduced to $0.23m\squared$.

Theorems & Definitions (1)

  • Remark 1