Table of Contents
Fetching ...

Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning

Olivier Goudet, Quentin Suire, Adrien Goëffon, Frédéric Saubion, Sylvain Lamprier

TL;DR

The paper tackles discrete black-box optimization by introducing an order-invariant reinforcement learning framework for Estimation-of-Distribution Algorithms (EDAs). By training neural autoregressive generators with randomly sampled generation orders and training orders, the method achieves robust exploration and mitigates reliance on a fixed dependency structure. A Proximal Policy Optimization–based backbone with scale-invariant GRPO advantages stabilizes updates, while a rank-based advantage supports monotone-transform invariant performance. Empirical results across QUBO, NK, and NK3 benchmarks show strong scalability and competitive performance, particularly on larger instances, indicating practical impact for high-dimensional discrete optimization tasks. Overall, the approach combines permutation-invariant modeling, structural regularization, and robust RL updates to advance sample-efficient, black-box combinatorial optimization.

Abstract

We introduce an order-invariant reinforcement learning framework for black-box combinatorial optimization. Classical estimation-of-distribution algorithms (EDAs) often rely on learning explicit variable dependency graphs, which can be costly and fail to capture complex interactions efficiently. In contrast, we parameterize a multivariate autoregressive generative model trained without a fixed variable ordering. By sampling random generation orders during training - a form of information-preserving dropout - the model is encouraged to be invariant to variable order, promoting search-space diversity and shaping the model to focus on the most relevant variable dependencies, improving sample efficiency. We adapt Generalized Reinforcement Policy Optimization (GRPO) to this setting, providing stable policy-gradient updates from scale-invariant advantages. Across a wide range of benchmark algorithms and problem instances of varying sizes, our method frequently achieves the best performance and consistently avoids catastrophic failures.

Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning

TL;DR

The paper tackles discrete black-box optimization by introducing an order-invariant reinforcement learning framework for Estimation-of-Distribution Algorithms (EDAs). By training neural autoregressive generators with randomly sampled generation orders and training orders, the method achieves robust exploration and mitigates reliance on a fixed dependency structure. A Proximal Policy Optimization–based backbone with scale-invariant GRPO advantages stabilizes updates, while a rank-based advantage supports monotone-transform invariant performance. Empirical results across QUBO, NK, and NK3 benchmarks show strong scalability and competitive performance, particularly on larger instances, indicating practical impact for high-dimensional discrete optimization tasks. Overall, the approach combines permutation-invariant modeling, structural regularization, and robust RL updates to advance sample-efficient, black-box combinatorial optimization.

Abstract

We introduce an order-invariant reinforcement learning framework for black-box combinatorial optimization. Classical estimation-of-distribution algorithms (EDAs) often rely on learning explicit variable dependency graphs, which can be costly and fail to capture complex interactions efficiently. In contrast, we parameterize a multivariate autoregressive generative model trained without a fixed variable ordering. By sampling random generation orders during training - a form of information-preserving dropout - the model is encouraged to be invariant to variable order, promoting search-space diversity and shaping the model to focus on the most relevant variable dependencies, improving sample efficiency. We adapt Generalized Reinforcement Policy Optimization (GRPO) to this setting, providing stable policy-gradient updates from scale-invariant advantages. Across a wide range of benchmark algorithms and problem instances of varying sizes, our method frequently achieves the best performance and consistently avoids catastrophic failures.

Paper Structure

This paper contains 40 sections, 1 theorem, 39 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

$\mathbb{E}[\hat{L}^t_\lambda(\theta)] = \mathbb{E}_{\sigma} \mathbb{E}_{x\sim\pi_{\theta^t(.|\sigma)}}[ w_{\theta^t,\theta}(x,\sigma)\;\mathbb{E}_{\Gamma_\lambda^t\setminus\{x\}}[A_{\Gamma_\lambda^t}(x)] + kl_{\theta^t,\theta}(x,\sigma) ]$

Figures (13)

  • Figure 1: X-axis: number of calls to the objective function. Y-axis: Evolution of average scores (a) and average distances (b) obtained by the different variants of multivariate RL EDA for 100 independent runs on instances of the NK problem with $N=256$ and $K=4$.
  • Figure 2: X-axis: number of calls to the objective function. Y-axis: Evolution of average scores.
  • Figure 3: Probability of having exactly $k$ available (non-masked) input variables during neural inference of the generation probabilities of values for any dimension. Left: input dropout without order permutations. Right: input dropout combined with order permutations.
  • Figure 4: Evolution of the average scores w.r.t. the number of calls to the objective function obtained by $(\sigma,\sigma')$-RL-EDA and the best 10 other competitors for the different type of QUBO instances with $n=128$.
  • Figure 5: Evolution of the average scores w.r.t. the number of calls to the objective function obtained by $(\sigma,\sigma')$-RL-EDA and the best 10 other competitors for the different type of NK3 instances with $n=128$.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof