Table of Contents
Fetching ...

Reinforcement Learning for Node Selection in Branch-and-Bound

Alexander Mattick, Christopher Mutschler

TL;DR

This work tackles the problem of node selection in branch-and-bound for MILP/MINLP by introducing a tree-aware reinforcement learning framework that reasons over the entire BnB tree. A graph neural network encodes the directed search tree, and a PPO-trained policy assigns probabilities to leaves via root-to-leaf path weights, enabling learned, tree-wide decisions. Trained on synthetic TSP-MILP instances, the approach generalizes to TSPLIB, UFLP, MINLPLIB, and MIPLIB, yielding improvements in optimality gap reduction and per-node efficiency under time constraints. The study demonstrates the potential of integrating tree-structured information into learned node selectors and highlights practical gains for state-of-the-art solvers, while outlining directions for feature design and efficiency optimization.

Abstract

A big challenge in branch and bound lies in identifying the optimal node within the search tree from which to proceed. Current state-of-the-art selectors utilize either hand-crafted ensembles that automatically switch between naive sub-node selectors, or learned node selectors that rely on individual node data. We propose a novel simulation technique that uses reinforcement learning (RL) while considering the entire tree state, rather than just isolated nodes. To achieve this, we train a graph neural network that produces a probability distribution based on the path from the model's root to its "to-be-selected" leaves. Modelling node-selection as a probability distribution allows us to train the model using state-of-the-art RL techniques that capture both intrinsic node-quality and node-evaluation costs. Our method induces a high quality node selection policy on a set of varied and complex problem sets, despite only being trained on specially designed, synthetic travelling salesmen problem (TSP) instances. Using such a fixed pretrained policy shows significant improvements on several benchmarks in optimality gap reductions and per-node efficiency under strict time constraints.

Reinforcement Learning for Node Selection in Branch-and-Bound

TL;DR

This work tackles the problem of node selection in branch-and-bound for MILP/MINLP by introducing a tree-aware reinforcement learning framework that reasons over the entire BnB tree. A graph neural network encodes the directed search tree, and a PPO-trained policy assigns probabilities to leaves via root-to-leaf path weights, enabling learned, tree-wide decisions. Trained on synthetic TSP-MILP instances, the approach generalizes to TSPLIB, UFLP, MINLPLIB, and MIPLIB, yielding improvements in optimality gap reduction and per-node efficiency under time constraints. The study demonstrates the potential of integrating tree-structured information into learned node selectors and highlights practical gains for state-of-the-art solvers, while outlining directions for feature design and efficiency optimization.

Abstract

A big challenge in branch and bound lies in identifying the optimal node within the search tree from which to proceed. Current state-of-the-art selectors utilize either hand-crafted ensembles that automatically switch between naive sub-node selectors, or learned node selectors that rely on individual node data. We propose a novel simulation technique that uses reinforcement learning (RL) while considering the entire tree state, rather than just isolated nodes. To achieve this, we train a graph neural network that produces a probability distribution based on the path from the model's root to its "to-be-selected" leaves. Modelling node-selection as a probability distribution allows us to train the model using state-of-the-art RL techniques that capture both intrinsic node-quality and node-evaluation costs. Our method induces a high quality node selection policy on a set of varied and complex problem sets, despite only being trained on specially designed, synthetic travelling salesmen problem (TSP) instances. Using such a fixed pretrained policy shows significant improvements on several benchmarks in optimality gap reductions and per-node efficiency under strict time constraints.
Paper Structure (31 sections, 17 equations, 2 figures, 6 tables)

This paper contains 31 sections, 17 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Our method: (1) SCIP solves individual nodes and executes existing heuristics. (2) Features are extracted from every branch-and-bound node and sent to individual normalization and embedding. (3) The node embeddings are subject to $K$ steps of GNN message passing on the induced tree-structure. (4) Based on the node embeddings, we generate root-to-leave paths, from which we sample the next node. The resulting node is submitted to SCIP and we return to step 1.
  • Figure 2: Naive approach using recursive selection. The probabilities are computed based on which "fork" of the tree is traveled. Sampling this can be done by sampling left or right based on $p_i$