Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems

Moritz Vinzent Seiler; Pascal Kerschke; Heike Trautmann

Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems

Moritz Vinzent Seiler, Pascal Kerschke, Heike Trautmann

TL;DR

This work addresses the limitations of classical Exploratory Landscape Analysis (ELA)—notably feature correlations and limited multi-objective applicability—by marrying ELA with self-supervised deep learning. It introduces Deep-ELA, a set of four pre-trained transformer backbones trained on millions of randomly generated single- and multi-objective optimization problems, producing invariant, low-correlation landscape features in $[-1,1]$ that can be used out-of-the-box or fine-tuned for downstream tasks. Through experiments on high-level property prediction and automated algorithm selection (both single- and multi-objective), Deep-ELA demonstrates competitive or superior performance to feature-based and feature-free baselines, particularly in data-constrained scenarios. The approach offers a scalable, plug-and-play representation for landscape analysis with practical impact on algorithm selection and problem understanding across objective regimes.

Abstract

In many recent works, the potential of Exploratory Landscape Analysis (ELA) features to numerically characterize, in particular, single-objective continuous optimization problems has been demonstrated. These numerical features provide the input for all kinds of machine learning tasks on continuous optimization problems, ranging, i.a., from High-level Property Prediction to Automated Algorithm Selection and Automated Algorithm Configuration. Without ELA features, analyzing and understanding the characteristics of single-objective continuous optimization problems is -- to the best of our knowledge -- very limited. Yet, despite their usefulness, as demonstrated in several past works, ELA features suffer from several drawbacks. These include, in particular, (1.) a strong correlation between multiple features, as well as (2.) its very limited applicability to multi-objective continuous optimization problems. As a remedy, recent works proposed deep learning-based approaches as alternatives to ELA. In these works, e.g., point-cloud transformers were used to characterize an optimization problem's fitness landscape. However, these approaches require a large amount of labeled training data. Within this work, we propose a hybrid approach, Deep-ELA, which combines (the benefits of) deep learning and ELA features. Specifically, we pre-trained four transformers on millions of randomly generated optimization problems to learn deep representations of the landscapes of continuous single- and multi-objective optimization problems. Our proposed framework can either be used out-of-the-box for analyzing single- and multi-objective continuous optimization problems, or subsequently fine-tuned to various tasks focussing on algorithm behavior and problem understanding.

Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems

TL;DR

that can be used out-of-the-box or fine-tuned for downstream tasks. Through experiments on high-level property prediction and automated algorithm selection (both single- and multi-objective), Deep-ELA demonstrates competitive or superior performance to feature-based and feature-free baselines, particularly in data-constrained scenarios. The approach offers a scalable, plug-and-play representation for landscape analysis with practical impact on algorithm selection and problem understanding across objective regimes.

Abstract

Paper Structure (20 sections, 13 equations, 7 figures, 5 tables)

This paper contains 20 sections, 13 equations, 7 figures, 5 tables.

Introduction and Related Work
Background
Exploratory Landscape Analysis
Deep Learning-Based Approaches as Alternative to ELA
Deep Exploratory Landscape Analysis
Outline of Deep ELA
Model Structure
$k$NN Embedding
Input Processing:
Embedding:
Contrastive Loss
Datasets
Black-Box Optimization Problems
Random Optimization Problems
Experiments
...and 5 more sections

Figures (7)

Figure 1: Comparison of the feature-based (left) versus feature-free (right) approach on one common downstream task: Automated Algorithm Selection (AAS). In the realm of AAS, there is no single universally superior algorithm for all problem instances. Instead, AAS uses a portfolio of algorithms $\mathcal{A} = \{A_1, \ldots, A_{n_\mathcal{A}} \}$. The optimal selector is formally defined as $S\colon\mathcal{I}\rightarrow\mathcal{A}$. Typically, the selector $S$ is trained using machine learning to optimize a given performance metric. However, standard machine learners in AAS cannot process raw problem instances directly, necessitating a transformation into numerical vectors. This transformation is given as $F: \mathcal{I} \to \mathcal{F} \subseteq \mathbb{R}^{{n_\mathcal{F}}}$, where $F$ is a mapper converting an instance $I\in\mathcal{I}$ into a real-valued vector, termed (instance) features, in the feature space$\mathcal{F}$. Therefore, in a standard AAS scenario, the selector is defined as $\dot S\colon\mathcal{F}\rightarrow\mathcal{A}$, accepting features rather than actual instances.
Figure 2: Comparison of exemplary correlation matrices of (Deep-)ELA features on BBOB: (a) classical ELA features (here: meta-model, dispersion and $y$-distribution), and (b) Deep-ELA features (Large-$50d$). Correlations are calculated across all 24 functions of the BBOB suite, but individually for every instance $1$ to $20$ and dimension $2,3,5,10$. Afterward, the 80 correlation maps are mean-aggregated.
Figure 3: Signal to Noise Ratio (SNR) of ELA features (left boxplot) on BBOB compared to the SNR values of features from four Deep-ELA models. We used $\text{SNR} = \mu^2 / \sigma^2$ with mean $\mu$ and standard deviation $\sigma$. For $\sigma\simeq0$, values are imputed with $10^{12}$, which is the highest observed value. Higher values indicate lower noise. SNR values are calculated per feature based on instances $1$ to $20$ and then mean-aggregated over the $24$ functions and four dimensions ($2,3,5,10$). Notches show the 95% confidence intervals around the median. The large models yield the highest SNR while the medium models yield the lowest which is to be expected as the large models contain more parameters to create more sophisticated features. Classical ELA features are somewhat 'in-between' while simultaneously containing features with multiple, very low SNR values.
Figure 4: Illustration of the chosen topology of the backbone model without the training heads. The model receives $(\mathcal{X},\mathcal{Y})$ as input and outputs $\mathcal{F}_\text{D-ELA}$ -- and, optionally, the embedding of the tokens $\mathcal{T}_\text{Final}$ after the final LayerNorm. $\mathcal{T}_\text{Final}$ is only relevant for the contrastive loss during training and is ignored after training. The initial $k$NN embedding seiler2022collection is used to capture the local information of all points from the input sample, and followed by a stride operator to optionally reduce the number of tokens without losing information. Next, the model consists of six Multi-Head Attention blocks, followed by a Feed-Forward block of two successive Linear layers each. The LayerNorm layers are after the shortcuts as proposed by nguyen2019transformers. We chose GLU activations in the Feed Forward layers with a $4\times$ larger number of hidden neurons. The last Linear + GLU layer projects the high-dimensional embeddings into lower dimensions. Afterward, the mean over all tokens is computed and normalized into $[-1,1]$ by a Tanh activation.
Figure 5: The two training heads on top of the backbone model. Note that the first two layers before the student's head are part of the backbone model (from BB.). Both heads are removed after training. The student's head gets updated by gradient descent while the momentum head is an old version of the student's head and gets updated through EMA. The design follows closely the idea of chen2021mocov3.
...and 2 more figures

Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems

TL;DR

Abstract

Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems

Authors

TL;DR

Abstract

Table of Contents

Figures (7)