Table of Contents
Fetching ...

Accelerating evolutionary exploration through language model-based transfer learning

Maximilian Reissmann, Yuan Fang, Andrew S. H. Ooi, Richard D. Sandberg

TL;DR

This work tackles symbolic regression via Gene Expression Programming (GEP) by introducing a language-model–based transfer-learning framework to create a biased, knowledge-informed starting population. An encoder–decoder transformer, built on a small, efficient architecture, learns from source-task representations of prior equations (tokenized as Karva strings) and informs the target task through a latent vector that guides GEP initialization and early exploration. Empirical results on eight UCI datasets and a CFD flow problem show that transferring a modest fraction of start-population building blocks (e.g., 25%) can improve early fitness and speed up convergence in several cases, though higher transfer can be detrimental when task similarity is limited. The approach reduces search-time overhead in symbolic regression and suggests promising extensions to other combinatorial or boolean domains, with practical implications for faster, more scalable evolutionary search.

Abstract

Gene expression programming is an evolutionary optimization algorithm with the potential to generate interpretable and easily implementable equations for regression problems. Despite knowledge gained from previous optimizations being potentially available, the initial candidate solutions are typically generated randomly at the beginning and often only include features or terms based on preliminary user assumptions. This random initial guess, which lacks constraints on the search space, typically results in higher computational costs in the search for an optimal solution. Meanwhile, transfer learning, a technique to reuse parts of trained models, has been successfully applied to neural networks. However, no generalized strategy for its use exists for symbolic regression in the context of evolutionary algorithms. In this work, we propose an approach for integrating transfer learning with gene expression programming applied to symbolic regression. The constructed framework integrates Natural Language Processing techniques to discern correlations and recurring patterns from equations explored during previous optimizations. This integration facilitates the transfer of acquired knowledge from similar tasks to new ones. Through empirical evaluation of the extended framework across a range of univariate problems from an open database and from the field of computational fluid dynamics, our results affirm that initial solutions derived via a transfer learning mechanism enhance the algorithm's convergence rate towards improved solutions.

Accelerating evolutionary exploration through language model-based transfer learning

TL;DR

This work tackles symbolic regression via Gene Expression Programming (GEP) by introducing a language-model–based transfer-learning framework to create a biased, knowledge-informed starting population. An encoder–decoder transformer, built on a small, efficient architecture, learns from source-task representations of prior equations (tokenized as Karva strings) and informs the target task through a latent vector that guides GEP initialization and early exploration. Empirical results on eight UCI datasets and a CFD flow problem show that transferring a modest fraction of start-population building blocks (e.g., 25%) can improve early fitness and speed up convergence in several cases, though higher transfer can be detrimental when task similarity is limited. The approach reduces search-time overhead in symbolic regression and suggests promising extensions to other combinatorial or boolean domains, with practical implications for faster, more scalable evolutionary search.

Abstract

Gene expression programming is an evolutionary optimization algorithm with the potential to generate interpretable and easily implementable equations for regression problems. Despite knowledge gained from previous optimizations being potentially available, the initial candidate solutions are typically generated randomly at the beginning and often only include features or terms based on preliminary user assumptions. This random initial guess, which lacks constraints on the search space, typically results in higher computational costs in the search for an optimal solution. Meanwhile, transfer learning, a technique to reuse parts of trained models, has been successfully applied to neural networks. However, no generalized strategy for its use exists for symbolic regression in the context of evolutionary algorithms. In this work, we propose an approach for integrating transfer learning with gene expression programming applied to symbolic regression. The constructed framework integrates Natural Language Processing techniques to discern correlations and recurring patterns from equations explored during previous optimizations. This integration facilitates the transfer of acquired knowledge from similar tasks to new ones. Through empirical evaluation of the extended framework across a range of univariate problems from an open database and from the field of computational fluid dynamics, our results affirm that initial solutions derived via a transfer learning mechanism enhance the algorithm's convergence rate towards improved solutions.
Paper Structure (16 sections, 16 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 16 sections, 16 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Figure showing the interaction between the evolutionary method Gene Expression Programming (GEP) and a language in the context of transfer learning. The process is structured in five distinct steps. Initially, GEP is employed to approximate a function for a source task (1). This is followed by the training of the language model using the combination of a representation and the approximation (2). For the application to a subsequent problem (3)-(4), a proportion of the start population is generated using the case-related representation followed by an optimization (5).
  • Figure 2: Picture of the various elements of tokenization for an example of a genotype encoded by the gene expression programming method. Here, a) shows the original encoding of two genes with the labeling of the individual parts (Head, Tail). The rows b) and c) additionally depict the tokens introduced and the sequence formulated with these elements. The legend signifies the meaning of the individual tokens.
  • Figure 3: Boxplot illustrating the distribution of the minimum error over the 25 performed runs. Here, comparing the conventional application (GEP - sampled 0) after the initial generation with the proposed method, considering different proportions of augmented individuals $(0.10,0.25,0.50,0.75)$ on four different cases from the UCI database. A lower fitness signifies a more suitable approximation.
  • Figure 4: Diagram presenting the average MAE of the fittest individuals for 25 different trials over 100 generations, whereby each plot shows the result for a particular case. The different markers indicate the proportion according to the sampled individuals, ranging from zero (standard GEP) over $0.1,0.25,0.5$ to $0.75$.
  • Figure 5: Diagram illustrating the comparative exploration of the best individual in addition to their standard deviation across 10 different trials (as measured by fitness) over 30 generations for the two distinct target tasks within the flow scenario. The set of methods tested comprises the conventional procedure (GEP: 'o'), a technique that replicates the best subtrees from the source task (CGEP, 'x'), and the novel approach discussed here (IGEP: '-'). The target tasks are delineated as follows: a) relates to a change in velocity, and b) pertains to a change in the spatial domain.
  • ...and 1 more figures