Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

Tian Huang; Shengbo Wang; Ke Li

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

Tian Huang, Shengbo Wang, Ke Li

TL;DR

The paper tackles interactive multi-objective optimization by learning user preferences directly from human feedback, bypassing explicit fitness models. It introduces D-PBEMO, a two-module framework with a clustering-based stochastic dueling bandit consultation module and a density-ratio based preference-elicitation module that yields a probabilistic guide $\widetilde{\Pr}(\mathbf{x})$ to steer MOEAs such as NSGA-II and MOEA/D. A clustering-based regret bound of $\mathcal{O}(K^2 \log T)$ is established for the consultation component, complemented by a KL-divergence termination criterion to manage DM effort. Empirical results across 33 synthetic benchmarks plus RNA inverse design and PSP demonstrate competitive performance relative to PBEMO state-of-the-art, with notable improvements as the number of objectives grows and the DM workload remains manageable.

Abstract

Optimization problems find widespread use in both single-objective and multi-objective scenarios. In practical applications, users aspire for solutions that converge to the region of interest (ROI) along the Pareto front (PF). While the conventional approach involves approximating a fitness function or an objective function to reflect user preferences, this paper explores an alternative avenue. Specifically, we aim to discover a method that sidesteps the need for calculating the fitness function, relying solely on human feedback. Our proposed approach entails conducting direct preference learning facilitated by an active dueling bandit algorithm. The experimental phase is structured into three sessions. Firstly, we assess the performance of our active dueling bandit algorithm. Secondly, we implement our proposed method within the context of Multi-objective Evolutionary Algorithms (MOEAs). Finally, we deploy our method in a practical problem, specifically in protein structure prediction (PSP). This research presents a novel interactive preference-based MOEA framework that not only addresses the limitations of traditional techniques but also unveils new possibilities for optimization problems.

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

TL;DR

to steer MOEAs such as NSGA-II and MOEA/D. A clustering-based regret bound of

is established for the consultation component, complemented by a KL-divergence termination criterion to manage DM effort. Empirical results across 33 synthetic benchmarks plus RNA inverse design and PSP demonstrate competitive performance relative to PBEMO state-of-the-art, with notable improvements as the number of objectives grows and the DM workload remains manageable.

Abstract

Paper Structure (52 sections, 2 theorems, 31 equations, 16 figures, 27 tables, 2 algorithms)

This paper contains 52 sections, 2 theorems, 31 equations, 16 figures, 27 tables, 2 algorithms.

Introduction
Preliminaries
Multi-Objective Optimization Problem
Preference Learning as Dueling Bandits
Proposed Method
Consultation Module
Step $1$: Partition $\mathcal{S}$ into $K$ subsets $\{\tilde{\mathcal{S}}^i\}_{i=1}^K$ based on solution features in the context of EMO.
Step $2$: Subset-level dueling sampling and solution-level pairwise comparisons.
Step $3$: Output the learned preferences.
Preference Elicitation Module
Experiments
Experimental Setup
Benchmark problems
Performance metrics
Comparison Results with State-of-the-art PBEMO algorithms
...and 37 more sections

Key Result

Theorem 3.3

Under the Assumptions assumption:tieprobability and assumption:tight, for any $\epsilon \in (0,1]$ and $\alpha>0.5$, the regret of our clustering-based stochastic dueling bandits algorithm is bounded by:

Figures (16)

Figure 1: (a) Flow chart of a conventional PBEMO. (b) Conceptual illustration of reward-based, model-based, and direct preference learning strategies.
Figure 2: (a) The evolutionary population of an EMO algorithm is divided into three subsets, where $\tilde{\mathcal{S}}^{2}$ covers the SOI (denoted as a $\star$). (b) After a PBEMO round, in the next consultation session, all solutions are steered towards the SOI and their spreads become more tightened towards the SOI.
Figure 3: The density ratio between $p_\nu(\tilde{\mathbf{x}})$ and $p_\ell(\tilde{\mathbf{x}})$ is shaded in blue, while its estimation is shaded in red. The SOI falls within the estimated Gaussian distribution for $95\%$ confidence interval.
Figure 4: Box plot for the Scott-Knott test rank of D-PBEMO and peer algorithms achieved by $33$ test problems running for $20$ times. The index of algorithms are as follows: 1 $\leadsto$D-PBNSGA-II, 2 $\leadsto$D-PBMOEA/D, 3 $\leadsto$I-MOEA/D-PLVF, 4 $\leadsto$I-NSGA-II/LTR, 5 $\leadsto$IEMO/D.
Figure 5: Comparison result of D-PBNSGA-II against the other three state-of-the-art PBEMO algorithms on a selected RNA inverse design task (Eterna ID: $852950$). The target structure is shaded in blue color while the predicted structures obtained by different optimization algorithms are highlighted in red color. In this experiment, the preference is set to $\sigma=1$. The closer $\sigma$ is to $1$, the better performance achieved by the corresponding algorithm. When the $\sigma$ shares the same biggest value, the smaller $MFE$ the better the performance is. Full results can be found in \ref{['app:science']}.
...and 11 more figures

Theorems & Definitions (10)

Definition 2.1: UrvoyCFN13
Definition 2.2
Remark 1
Remark 2
Theorem 3.3
Remark 3
Theorem 3.4
proof : Proof of Theorem \ref{['thm:regret']}
Remark 4
proof : Proof of Theorem \ref{['thm:convergence']}

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

TL;DR

Abstract

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (10)