Table of Contents
Fetching ...

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

Tian Huang, Shengbo Wang, Ke Li

TL;DR

The paper tackles interactive multi-objective optimization by learning user preferences directly from human feedback, bypassing explicit fitness models. It introduces D-PBEMO, a two-module framework with a clustering-based stochastic dueling bandit consultation module and a density-ratio based preference-elicitation module that yields a probabilistic guide $\widetilde{\Pr}(\mathbf{x})$ to steer MOEAs such as NSGA-II and MOEA/D. A clustering-based regret bound of $\mathcal{O}(K^2 \log T)$ is established for the consultation component, complemented by a KL-divergence termination criterion to manage DM effort. Empirical results across 33 synthetic benchmarks plus RNA inverse design and PSP demonstrate competitive performance relative to PBEMO state-of-the-art, with notable improvements as the number of objectives grows and the DM workload remains manageable.

Abstract

Optimization problems find widespread use in both single-objective and multi-objective scenarios. In practical applications, users aspire for solutions that converge to the region of interest (ROI) along the Pareto front (PF). While the conventional approach involves approximating a fitness function or an objective function to reflect user preferences, this paper explores an alternative avenue. Specifically, we aim to discover a method that sidesteps the need for calculating the fitness function, relying solely on human feedback. Our proposed approach entails conducting direct preference learning facilitated by an active dueling bandit algorithm. The experimental phase is structured into three sessions. Firstly, we assess the performance of our active dueling bandit algorithm. Secondly, we implement our proposed method within the context of Multi-objective Evolutionary Algorithms (MOEAs). Finally, we deploy our method in a practical problem, specifically in protein structure prediction (PSP). This research presents a novel interactive preference-based MOEA framework that not only addresses the limitations of traditional techniques but also unveils new possibilities for optimization problems.

Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandit

TL;DR

The paper tackles interactive multi-objective optimization by learning user preferences directly from human feedback, bypassing explicit fitness models. It introduces D-PBEMO, a two-module framework with a clustering-based stochastic dueling bandit consultation module and a density-ratio based preference-elicitation module that yields a probabilistic guide to steer MOEAs such as NSGA-II and MOEA/D. A clustering-based regret bound of is established for the consultation component, complemented by a KL-divergence termination criterion to manage DM effort. Empirical results across 33 synthetic benchmarks plus RNA inverse design and PSP demonstrate competitive performance relative to PBEMO state-of-the-art, with notable improvements as the number of objectives grows and the DM workload remains manageable.

Abstract

Optimization problems find widespread use in both single-objective and multi-objective scenarios. In practical applications, users aspire for solutions that converge to the region of interest (ROI) along the Pareto front (PF). While the conventional approach involves approximating a fitness function or an objective function to reflect user preferences, this paper explores an alternative avenue. Specifically, we aim to discover a method that sidesteps the need for calculating the fitness function, relying solely on human feedback. Our proposed approach entails conducting direct preference learning facilitated by an active dueling bandit algorithm. The experimental phase is structured into three sessions. Firstly, we assess the performance of our active dueling bandit algorithm. Secondly, we implement our proposed method within the context of Multi-objective Evolutionary Algorithms (MOEAs). Finally, we deploy our method in a practical problem, specifically in protein structure prediction (PSP). This research presents a novel interactive preference-based MOEA framework that not only addresses the limitations of traditional techniques but also unveils new possibilities for optimization problems.
Paper Structure (52 sections, 2 theorems, 31 equations, 16 figures, 27 tables, 2 algorithms)

This paper contains 52 sections, 2 theorems, 31 equations, 16 figures, 27 tables, 2 algorithms.

Key Result

Theorem 3.3

Under the Assumptions assumption:tieprobability and assumption:tight, for any $\epsilon \in (0,1]$ and $\alpha>0.5$, the regret of our clustering-based stochastic dueling bandits algorithm is bounded by:

Figures (16)

  • Figure 1: (a) Flow chart of a conventional PBEMO. (b) Conceptual illustration of reward-based, model-based, and direct preference learning strategies.
  • Figure 2: (a) The evolutionary population of an EMO algorithm is divided into three subsets, where $\tilde{\mathcal{S}}^{2}$ covers the SOI (denoted as a $\star$). (b) After a PBEMO round, in the next consultation session, all solutions are steered towards the SOI and their spreads become more tightened towards the SOI.
  • Figure 3: The density ratio between $p_\nu(\tilde{\mathbf{x}})$ and $p_\ell(\tilde{\mathbf{x}})$ is shaded in blue, while its estimation is shaded in red. The SOI falls within the estimated Gaussian distribution for $95\%$ confidence interval.
  • Figure 4: Box plot for the Scott-Knott test rank of D-PBEMO and peer algorithms achieved by $33$ test problems running for $20$ times. The index of algorithms are as follows: 1 $\leadsto$D-PBNSGA-II, 2 $\leadsto$D-PBMOEA/D, 3 $\leadsto$I-MOEA/D-PLVF, 4 $\leadsto$I-NSGA-II/LTR, 5 $\leadsto$IEMO/D.
  • Figure 5: Comparison result of D-PBNSGA-II against the other three state-of-the-art PBEMO algorithms on a selected RNA inverse design task (Eterna ID: $852950$). The target structure is shaded in blue color while the predicted structures obtained by different optimization algorithms are highlighted in red color. In this experiment, the preference is set to $\sigma=1$. The closer $\sigma$ is to $1$, the better performance achieved by the corresponding algorithm. When the $\sigma$ shares the same biggest value, the smaller $MFE$ the better the performance is. Full results can be found in \ref{['app:science']}.
  • ...and 11 more figures

Theorems & Definitions (10)

  • Definition 2.1: UrvoyCFN13
  • Definition 2.2
  • Remark 1
  • Remark 2
  • Theorem 3.3
  • Remark 3
  • Theorem 3.4
  • proof : Proof of Theorem \ref{['thm:regret']}
  • Remark 4
  • proof : Proof of Theorem \ref{['thm:convergence']}