Combinatorial Optimization with Policy Adaptation using Latent Space Search

Felix Chalumeau; Shikha Surana; Clement Bonnet; Nathan Grinsztajn; Arnu Pretorius; Alexandre Laterre; Thomas D. Barrett

Combinatorial Optimization with Policy Adaptation using Latent Space Search

Felix Chalumeau, Shikha Surana, Clement Bonnet, Nathan Grinsztajn, Arnu Pretorius, Alexandre Laterre, Thomas D. Barrett

TL;DR

COMPASS addresses the challenge of solving NP-hard combinatorial optimization problems by learning a continuous latent space of diverse and specialized policies conditioned on a latent vector $\mathbf{z} \in [-1,1]^{16}$. At inference, it searches this space with CMA-ES (using multiple components and Voronoi initialization) to rapidly adapt to each instance without re-training, while training only the best latent-conditioned policy per instance via REINFORCE. Evaluated on TSP, CVRP, and JSSP, COMPASS achieves state-of-the-art performance across 11 tasks and demonstrates robust generalization to procedurally transformed, out-of-distribution instances, often outperforming strong baselines with lower computational cost than competitive active-search methods. The work also provides extensive analyses of the latent space, search dynamics, and practical considerations such as code releases and runtime implications, highlighting COMPASS’s potential for industrial CO tasks and future improvements in latent-space regularization and diversity.

Abstract

Combinatorial Optimization underpins many real-world applications and yet, designing performant algorithms to solve these complex, typically NP-hard, problems remains a significant research challenge. Reinforcement Learning (RL) provides a versatile framework for designing heuristics across a broad spectrum of problem domains. However, despite notable progress, RL has not yet supplanted industrial solvers as the go-to solution. Current approaches emphasize pre-training heuristics that construct solutions but often rely on search procedures with limited variance, such as stochastically sampling numerous solutions from a single policy or employing computationally expensive fine-tuning of the policy on individual problem instances. Building on the intuition that performant search at inference time should be anticipated during pre-training, we propose COMPASS, a novel RL approach that parameterizes a distribution of diverse and specialized policies conditioned on a continuous latent space. We evaluate COMPASS across three canonical problems - Travelling Salesman, Capacitated Vehicle Routing, and Job-Shop Scheduling - and demonstrate that our search strategy (i) outperforms state-of-the-art approaches on 11 standard benchmarking tasks and (ii) generalizes better, surpassing all other approaches on a set of 18 procedurally transformed instance distributions.

Combinatorial Optimization with Policy Adaptation using Latent Space Search

TL;DR

COMPASS addresses the challenge of solving NP-hard combinatorial optimization problems by learning a continuous latent space of diverse and specialized policies conditioned on a latent vector

. At inference, it searches this space with CMA-ES (using multiple components and Voronoi initialization) to rapidly adapt to each instance without re-training, while training only the best latent-conditioned policy per instance via REINFORCE. Evaluated on TSP, CVRP, and JSSP, COMPASS achieves state-of-the-art performance across 11 tasks and demonstrates robust generalization to procedurally transformed, out-of-distribution instances, often outperforming strong baselines with lower computational cost than competitive active-search methods. The work also provides extensive analyses of the latent space, search dynamics, and practical considerations such as code releases and runtime implications, highlighting COMPASS’s potential for industrial CO tasks and future improvements in latent-space regularization and diversity.

Abstract

Paper Structure (52 sections, 1 equation, 13 figures, 8 tables)

This paper contains 52 sections, 1 equation, 13 figures, 8 tables.

Introduction
Related work
Construction methods for CO
Improving solutions at inference time
Methods
Preliminaries
Formulation
COMPASS
Latent space
Architecture
Training
Inference-time search
Experiments
In Distribution
Larger Instances
...and 37 more sections

Figures (13)

Figure 1: Our method COMPASS is composed of the following two phases. A. Training - the latent space is sampled to generate vectors that the policy can condition upon. The conditioned policies are then evaluated and only the best one is trained to create specialization within the latent space. B. Inference - at inference time the latent space is searched through an evolution strategy to exploit regions with high-performing policies for each instance.
Figure 2: Performance of COMPASS and the main baselines aggregated across several tasks over three problems (TSP, CVRP, and JSSP). For each task (problem type, instance size, mutation power), we normalize values between 0 and 1 (corresponding to the worst and best performing RL method, respectively). Hence, all tasks have the same impact on the aggregated metrics. COMPASS surpasses the baselines on all of them, showing its versatility for all types of tasks and in particular, its generalization capacity.
Figure 3: Relative difference between COMPASS and baselines as a function of mutation power. COMPASS outperforms the baselines on all 18 evaluation sets. Most methods have a decreasing performance ratio, showing that COMPASS generalizes better: its evolution strategy is able to find areas of its latent space that are high-performing, even on instances that are out-of-distribution.
Figure 4: Evolution of the overall performance and last performance obtained by the methods during their search on TSP150 - averaged on 1000 instances. The right plot reports mean and standard deviations of the most recent shots tried by methods during the search. It illustrates how COMPASS efficiently explores its latent space to search for high-performing solutions.
Figure 5: Contour plot of COMPASS's latent space, reflecting performance on a problem instance. White crosses show the successive means of a CMA-ES component during the search. The width of the path is proportional to the search's variance.
...and 8 more figures

Combinatorial Optimization with Policy Adaptation using Latent Space Search

TL;DR

Abstract

Combinatorial Optimization with Policy Adaptation using Latent Space Search

Authors

TL;DR

Abstract

Table of Contents

Figures (13)