Learning effective good variables from physical data

Giulio Barletta; Giovanni Trezza; Eliodoro Chiavazzo

Learning effective good variables from physical data

Giulio Barletta, Giovanni Trezza, Eliodoro Chiavazzo

TL;DR

The paper tackles the problem of discovering compact, physically meaningful variable sets that govern a target property directly from data. It introduces two complementary ML pathways: regression-based identification of invariant variable groups in power-form (and general forms) and classification-based optimization of mixed features to maximize class separation, using multi-objective criteria. Through applications to Dittus-Boelter, Gnielinski, and Newton's law, the methods uncover invariant groups with exponents near theoretical values, demonstrate reliable regression performance, and show that a small set of optimized mixed features can sharply distinguish classes, reducing dimensionality while preserving predictive power. The approach promises practical impact for model simplification, experiment design, and efficient optimization in physics-informed data analysis, with public code and data resources.

Abstract

We assume that a sufficiently large database is available, where a physical property of interest and a number of associated ruling primitive variables or observables are stored. We introduce and test two machine learning approaches to discover possible groups or combinations of primitive variables: The first approach is based on regression models whereas the second on classification models. The variable group (here referred to as the new effective good variable) can be considered as successfully found, when the physical property of interest is characterized by the following effective invariant behaviour: In the first method, invariance of the group implies invariance of the property up to a given accuracy; in the other method, upon partition of the physical property values into two or more classes, invariance of the group implies invariance of the class. For the sake of illustration, the two methods are successfully applied to two popular empirical correlations describing the convective heat transfer phenomenon and to the Newton's law of universal gravitation.

Learning effective good variables from physical data

TL;DR

Abstract

Paper Structure (18 sections, 20 equations, 12 figures, 2 tables)

This paper contains 18 sections, 20 equations, 12 figures, 2 tables.

Introduction
Methods
Datasets creation
Searching for good variables by regression models
Single invariant group in power form
Multiple concurrent invariant groups in power form
Further generalization to non power forms
Regression model and procedure implementation
Searching for good variables by classification models
Numerical examples and discussion
Dittus-Boelter equation
Use of regression models
Use of classification models
Gnielinski correlation
Use of regression models
...and 3 more sections

Figures (12)

Figure 1: Overview of the protocol used to detect possible symmetries of a target property of interest with respect to its input variables, utilizing only data and ignoring the analytical functional dependence. Two distinct methodologies are presented: the former for identifying, in regression tasks, invariant groups in the form $x_i^{\alpha_1} x_j^{\alpha_2} \cdots x_m^{\alpha_p}$, among others; the latter for identifying, in classification tasks, one or several mixed features as power combinations of the input variables to achieve an optimal class separation.
Figure 2: Overview of the procedure for identifying invariant groups/sets. A regression model is trained on the physical data and used to compute the gradient of the objective function in a point $\mathbf{x}_{0}$. The matrix B is constructed according to the functional structure of the investigated group/set, and its kernel K is computed. Finally, the condition of invariance between gradient and kernel is coupled to the normalization conditions of the coefficients. If the resulting non-linear system is satisfied for the same coefficients over the $f(\mathbf{x})$ domain, the group/set is an intrinsic variable and $f(\mathbf{x})$ is invariant with respect to it.
Figure 3: Overview of the procedure to identify optimal mixed variables for class separation. Threshold values are chosen to divide the physical data in classes. A Pareto optimization is performed to construct a reduced set of synthetic features that simultaneously maximizes the Bhattacharyya distance between the classes and (i) minimizes the variances of the class distributions, in the one dimensional case, or (ii) minimizes the determinants of the covariance matrix of the class distributions, in the multi dimensional case.
Figure 4: Results of the DNN regression model for the noised Nusselt number $\overline{\mathrm{Nu}}$ in the Dittus-Boelter correlation. a Predictions over the testing set, and b corresponding loss curves for the DNN model. Model performances are shown in terms of coefficient of determination $R^{2}$, mean absolute error (MAE), and root mean squared error (RMSE).
Figure 5: One dimensional example for classification on the Dittus-Boelter correlation. a PDFs over binned data of the training set for the two classes ($\overline{\mathrm{Nu}}<395$ and $\overline{\mathrm{Nu}} \geq 395$) reported against the normalized flow velocity. b PDFs over binned data of the training set for the two classes reported against the mixed feature $y_{1}$, constructed according to Eq. (\ref{['eq16']}) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the two binnings. c PDFs over binned data of the testing set for the two classes reported against the same mixed feature $y_{1}$ together with the same GEV fittings of the b subfigure. The mixed variable $\Tilde{y}_1$ shown here is referred exclusively to this Dittus-Boelter one dimensional optimization.
...and 7 more figures

Learning effective good variables from physical data

TL;DR

Abstract

Learning effective good variables from physical data

Authors

TL;DR

Abstract

Table of Contents

Figures (12)