Table of Contents
Fetching ...

ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

Han Fang, Paul Weng, Yutong Ban

TL;DR

This work addresses cross-distribution generalization in neural combinatorial optimization (CO) by uncovering the Satisficing Generalization Edge, which argues that identifying a set of promising actions generalizes better than selecting the single optimum. It introduces Adaptive Selection After Proposal (ASAP), a two-stage framework that decouples proposal generation from final selection and couples it with a two-phase training regime and MAML to enable rapid online adaptation. Theoretical insights and extensive experiments on 3D Bin Packing, TSP, and CVRP show that ASAP improves out-of-distribution generalization and accelerates adaptation with minimal inference overhead. The results suggest a general and practical paradigm for deploying neural solvers in dynamic CO settings, with broad applicability beyond the tested domains.

Abstract

Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

TL;DR

This work addresses cross-distribution generalization in neural combinatorial optimization (CO) by uncovering the Satisficing Generalization Edge, which argues that identifying a set of promising actions generalizes better than selecting the single optimum. It introduces Adaptive Selection After Proposal (ASAP), a two-stage framework that decouples proposal generation from final selection and couples it with a two-phase training regime and MAML to enable rapid online adaptation. Theoretical insights and extensive experiments on 3D Bin Packing, TSP, and CVRP show that ASAP improves out-of-distribution generalization and accelerates adaptation with minimal inference overhead. The results suggest a general and practical paradigm for deploying neural solvers in dynamic CO settings, with broad applicability beyond the tested domains.

Abstract

Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

Paper Structure

This paper contains 62 sections, 4 theorems, 26 equations, 10 figures, 11 tables, 1 algorithm.

Key Result

Theorem 4.2

Given a policy $\pi$ assigning probability $p_{t_1}$ to the optimal action, the probability of selecting the optimal action via the Two-Stage process ($\mathbb P^{two}$) strictly exceeds the probability via the One-Stage process ($\mathbb P^{one}$), i.e., $\mathbb P^{two} > \mathbb P^{one}$, if:

Figures (10)

  • Figure 1: Training on fixed distributions, ASAP allows us to quickly adapt to other distributions for Combinatorial Optimization problems.
  • Figure 2: Preliminary results (see \ref{['sec:setup']} for dataset description) indicate the factor leading to the generalization gap. (a) Comparison of Optimal Frequencies and Policy-Induced Frequencies vs. Rank of Choices. The left figure shows the results on the Default (training) dataset, while the right figure displays the results on the ID-Small dataset. (b) Results of top-1 action (induced by MCTS) including rate by different sizes of proposal action set on cross-distribution datasets.
  • Figure 3: Overview of the ASAP architecture. (A) Inference Flow: The decision process is decoupled into a Proposal Policy ($\pi^p$) that generates a candidate subset, and a Selection Policy ($\pi^s$) that picks the final action. (B) Two-Phase Training: To ensure convergence, we first pretrain a base model (Phase 1) to initialize parameters, followed by cooperative tuning (Phase 2) where both policies interact. (C) Adaptation: On new distributions, the proposal policy is frozen to maintain general candidate quality, while the selection policy is quickly fine-tuned to fit the specific domain.
  • Figure 4: Demonstration of MCTS experiments.
  • Figure 5: Full Preliminary Results for Distribution Mismatch in Discrete Environment.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Definition 4.1: Two-Stage Decision Process
  • Theorem 4.2: Superiority of Two-Stage Decision Process
  • Theorem 4.3: Robustness of Proposal Set Inclusion
  • Theorem 2.1: Superiority of Two-Stage Decision Process
  • Theorem 2.2: Robustness of Proposal Set Inclusion