Table of Contents
Fetching ...

Symbolically Regressing Fish Biomass Spectral Data: A Linear Genetic Programming Method with Tunable Primitives

Zhixing Huang, Bing Xue, Mengjie Zhang, Jeremy S. Ronney, Keith C. Gordon, Daniel P. Killeen

TL;DR

The paper tackles the challenge of predicting fish biomass from spectroscopic data under limited, noisy data, which also demands interpretability. It introduces LGP-TP, a linear genetic programming framework with tunable primitives that jointly learn symbolic program structure and coefficients, producing compact, interpretable regression models. Empirical results across ten biomass targets show LGP-TP achieving superior or competitive predictive accuracy and robust generalization across spectral-data treatments and a symbolic-regression benchmark (SRBench). The approach highlights actionable spectral features and maintains favorable training efficiency, indicating practical applicability for fast, non-destructive biomass estimation in production settings.

Abstract

Machine learning techniques play an important role in analyzing spectral data. The spectral data of fish biomass is useful in fish production, as it carries many important chemistry properties of fish meat. However, it is challenging for existing machine learning techniques to comprehensively discover hidden patterns from fish biomass spectral data since the spectral data often have a lot of noises while the training data are quite limited. To better analyze fish biomass spectral data, this paper models it as a symbolic regression problem and solves it by a linear genetic programming method with newly proposed tunable primitives. In the symbolic regression problem, linear genetic programming automatically synthesizes regression models based on the given primitives and training data. The tunable primitives further improve the approximation ability of the regression models by tuning their inherent coefficients. Our empirical results over ten fish biomass targets show that the proposed method improves the overall performance of fish biomass composition prediction. The synthesized regression models are compact and have good interpretability, which allow us to highlight useful features over the spectrum. Our further investigation also verifies the good generality of the proposed method across various spectral data treatments and other symbolic regression problems.

Symbolically Regressing Fish Biomass Spectral Data: A Linear Genetic Programming Method with Tunable Primitives

TL;DR

The paper tackles the challenge of predicting fish biomass from spectroscopic data under limited, noisy data, which also demands interpretability. It introduces LGP-TP, a linear genetic programming framework with tunable primitives that jointly learn symbolic program structure and coefficients, producing compact, interpretable regression models. Empirical results across ten biomass targets show LGP-TP achieving superior or competitive predictive accuracy and robust generalization across spectral-data treatments and a symbolic-regression benchmark (SRBench). The approach highlights actionable spectral features and maintains favorable training efficiency, indicating practical applicability for fast, non-destructive biomass estimation in production settings.

Abstract

Machine learning techniques play an important role in analyzing spectral data. The spectral data of fish biomass is useful in fish production, as it carries many important chemistry properties of fish meat. However, it is challenging for existing machine learning techniques to comprehensively discover hidden patterns from fish biomass spectral data since the spectral data often have a lot of noises while the training data are quite limited. To better analyze fish biomass spectral data, this paper models it as a symbolic regression problem and solves it by a linear genetic programming method with newly proposed tunable primitives. In the symbolic regression problem, linear genetic programming automatically synthesizes regression models based on the given primitives and training data. The tunable primitives further improve the approximation ability of the regression models by tuning their inherent coefficients. Our empirical results over ten fish biomass targets show that the proposed method improves the overall performance of fish biomass composition prediction. The synthesized regression models are compact and have good interpretability, which allow us to highlight useful features over the spectrum. Our further investigation also verifies the good generality of the proposed method across various spectral data treatments and other symbolic regression problems.

Paper Structure

This paper contains 17 sections, 19 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Schematic diagram of applying LGP-TP for fish biomass prediction. Particularly, LGP-TP synthesizes regression models based on the spectral data (inputs) and the chemical ground truth of on-hand fish samples (target outputs). The synthesized regression model from LGP-TP predicts the biomass of unseen fish samples in real-world production.
  • Figure 2: Test performance of the compared methods. (a) Test R$^2$ of the compared methods in the 10 six-fold cross-validations, each box plot for a biomass target. (b) The radar chart shows the mean ranks of the compared methods over different biomass targets. The legend of the chart shows the overall mean ranks by Friedmant's test and the statistical results by the Wilcoxon rank-sum test ($\alpha=0.05$). "win" indicates the number of targets where a compared method is significantly better than the proposed LGP-TP. "draw" indicates the number of targets without significantly different test R$^2$. "lose" indicates the number of targets where a compared method is significantly worse than LGP-TP. (c) The ablation study of LGP-TP by adding tunable primitives one-by-one in LGP evolution. The legend follows the same design as sub-figure b.
  • Figure 3: Interpretability analysis of LGP-TP. (a) An example synthesized regression model by LGP-TP for predicting water in fish meat, outputting results from the first register R0. (b) Highlighted input features by the result models. The upper part of sub-figure b is the Raman spectra data with highlighted features in yellow. The three heat maps in the lower part show the frequency of input terminals over feature ranges (X-axis: percentage over the range of wavenumber, Y-axis: tunable terminals). The dark color indicates a high frequency over the 10 six-fold cross-validation. (c) Mean program size ($\pm$ std.) of synthesized regression models over generations from 10 six-fold cross-validation. We denote the number of effective instructions as the program size.
  • Figure 4: Training and test performance of LGP-TP and the general effectiveness across spectral data treatments. (a) The training and test R$^2$ of LGP-TP in four example biomass targets. (b) Test R$^2$ (X-axis) of the four compared methods for different data treatments over 10 six-fold cross-validation. The three rows stand for three different spectral data treatments. We show the box plots of test R$^2$ on three example biomass targets and the bar chart of mean ranks (and corresponding p-values) over the ten biomass targets by Friedman's test.
  • Figure 5: Results of SRBench. (a) Comprehensive comparison of benchmark methods in terms of test R$^2$, model size, and training time. (b) Non-dominate sets over model size rank and test R$^2$ rank. (c) Training time of benchmark methods over different numbers of training instances.
  • ...and 1 more figures