Table of Contents
Fetching ...

Characterising harmful data sources when constructing multi-fidelity surrogate models

Nicolau Andrés-Thió, Mario Andrés Muñoz, Kate Smith-Miles

TL;DR

The paper tackles the problem of selecting among high- and low-fidelity data sources in expensive black-box surrogate modeling, where prior benchmarks may be biased and data are limited. It employs Instance Space Analysis (ISA) to map when low-fidelity sources are harmful or beneficial and to visualize regions of guidance, supported by an unbiased benchmark suite combining literature, disturbance-based, and SOLAR-based functions. A predictive selector based on ISA achieves about 80.6% accuracy in choosing between Kriging and Co-Kriging, and simple, rule-based guidelines reach ~81.6% accuracy, offering practical, bias-free recommendations for industry. The work advances both methodological understanding and applied decision-making in multi-fidelity surrogate modeling with limited data.

Abstract

Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting.

Characterising harmful data sources when constructing multi-fidelity surrogate models

TL;DR

The paper tackles the problem of selecting among high- and low-fidelity data sources in expensive black-box surrogate modeling, where prior benchmarks may be biased and data are limited. It employs Instance Space Analysis (ISA) to map when low-fidelity sources are harmful or beneficial and to visualize regions of guidance, supported by an unbiased benchmark suite combining literature, disturbance-based, and SOLAR-based functions. A predictive selector based on ISA achieves about 80.6% accuracy in choosing between Kriging and Co-Kriging, and simple, rule-based guidelines reach ~81.6% accuracy, offering practical, bias-free recommendations for industry. The work advances both methodological understanding and applied decision-making in multi-fidelity surrogate modeling with limited data.

Abstract

Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting.
Paper Structure (14 sections, 11 equations, 15 figures, 1 table)

This paper contains 14 sections, 11 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: ISA framework smith2023instance
  • Figure 2: A short description of the features used in this study, as well as the range of feature values for each feature.
  • Figure 3: Sources of the function pairs used in each of the instances, where the dark blue points represent instances created from the SOLAR simulation, the light blue points represent classical literature instances, and the yellow points represent disturbance-based instances.
  • Figure 4: Binary performance of \ref{['fig:binaryPerformanceKriging']} Kriging and \ref{['fig:binaryPerformanceCoKriging']} Co-Kriging models. The blue points represent instances for which the model's performance is labelled good, and the orange points represent instances for which the performance is labelled bad.
  • Figure 5: SVM predictions of \ref{['fig:binaryPerformanceKrigingSVM']} Kriging and \ref{['fig:binaryPerformanceCoKrigingSVM']} Co-Kriging performance. Blue points represent instances where the model's performance is predicted to be good, and the orange points represent instances where the performance is predicted to be bad.
  • ...and 10 more figures