Table of Contents
Fetching ...

Synthetic data generation for system identification: leveraging knowledge transfer from similar systems

Dario Piga, Matteo Rufolo, Gabriele Maroni, Manas Mejari, Marco Forgione

TL;DR

This paper tackles overfitting in dynamical system identification under data scarcity by generating synthetic data through a pre-trained Transformer meta-model that encodes knowledge of a broad class of systems. The approach uses real training data as context to elicit synthetic input-output sequences from the meta-model and to augment the estimation loss with a gamma-weighted synthetic term, with gamma tuned by hold-out validation and early stopping. A Wiener-Hammerstein numerical example demonstrates that incorporating synthetic data improves generalization, increasing the median test R^2 from 0.889 to 0.956. The work highlights knowledge transfer across similar systems as a practical means to mitigate data bottlenecks and suggests future directions in uncertainty weighting and Bayesian integration.

Abstract

This paper addresses the challenge of overfitting in the learning of dynamical systems by introducing a novel approach for the generation of synthetic data, aimed at enhancing model generalization and robustness in scenarios characterized by data scarcity. Central to the proposed methodology is the concept of knowledge transfer from systems within the same class. Specifically, synthetic data is generated through a pre-trained meta-model that describes a broad class of systems to which the system of interest is assumed to belong. Training data serves a dual purpose: firstly, as input to the pre-trained meta model to discern the system's dynamics, enabling the prediction of its behavior and thereby generating synthetic output sequences for new input sequences; secondly, in conjunction with synthetic data, to define the loss function used for model estimation. A validation dataset is used to tune a scalar hyper-parameter balancing the relative importance of training and synthetic data in the definition of the loss function. The same validation set can be also used for other purposes, such as early stopping during the training, fundamental to avoid overfitting in case of small-size training datasets. The efficacy of the approach is shown through a numerical example that highlights the advantages of integrating synthetic data into the system identification process.

Synthetic data generation for system identification: leveraging knowledge transfer from similar systems

TL;DR

This paper tackles overfitting in dynamical system identification under data scarcity by generating synthetic data through a pre-trained Transformer meta-model that encodes knowledge of a broad class of systems. The approach uses real training data as context to elicit synthetic input-output sequences from the meta-model and to augment the estimation loss with a gamma-weighted synthetic term, with gamma tuned by hold-out validation and early stopping. A Wiener-Hammerstein numerical example demonstrates that incorporating synthetic data improves generalization, increasing the median test R^2 from 0.889 to 0.956. The work highlights knowledge transfer across similar systems as a practical means to mitigate data bottlenecks and suggests future directions in uncertainty weighting and Bayesian integration.

Abstract

This paper addresses the challenge of overfitting in the learning of dynamical systems by introducing a novel approach for the generation of synthetic data, aimed at enhancing model generalization and robustness in scenarios characterized by data scarcity. Central to the proposed methodology is the concept of knowledge transfer from systems within the same class. Specifically, synthetic data is generated through a pre-trained meta-model that describes a broad class of systems to which the system of interest is assumed to belong. Training data serves a dual purpose: firstly, as input to the pre-trained meta model to discern the system's dynamics, enabling the prediction of its behavior and thereby generating synthetic output sequences for new input sequences; secondly, in conjunction with synthetic data, to define the loss function used for model estimation. A validation dataset is used to tune a scalar hyper-parameter balancing the relative importance of training and synthetic data in the definition of the loss function. The same validation set can be also used for other purposes, such as early stopping during the training, fundamental to avoid overfitting in case of small-size training datasets. The efficacy of the approach is shown through a numerical example that highlights the advantages of integrating synthetic data into the system identification process.
Paper Structure (10 sections, 7 equations, 4 figures)

This paper contains 10 sections, 7 equations, 4 figures.

Figures (4)

  • Figure 1: Encoder-decoder Transformer model for the class of systems, used to generate synthetic data. The Transformer is characterized by: number of layers ($n_{\text{layers}}$), model dimensionality per layer ($d_{\text{model}}$), number of attention heads ($n_{\text{heads}}$), and context window length ($m$).
  • Figure 2: The Wiener-Hammerstein system structure.
  • Figure 3: Impact of regularization hyparameter $\gamma$ on mean squared error in training (left panel) and validation (right panel) dataset. Boxplots of average squared error over $100$ Monte Carlo runs. In the right panel, the vertical limit is set to $4$ for a better visualization of the boxplots associated to $\gamma \neq 0$.
  • Figure 4: Impact of synthetic data on $R^2$ performance in test dataset. Boxplots of $R^2$ coefficients over $100$ Monte Carlo runs: without using synthetic data (left); using synthetic data (right).