Table of Contents
Fetching ...

Rethink Efficiency Side of Neural Combinatorial Solver: An Offline and Self-Play Paradigm

Zhenxing Xu, Zeyuan Ma, Weidong Bao, Hui Yan, Yan Zheng, Ji Wang

TL;DR

Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput.

Abstract

We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline paradigm; and 3) Progressive Bootstrapping: To stabilize training, we employ a heuristic-based bootstrapping mechanism that ensures continuous policy improvement during training. Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput. We provide further in-depth analysis on the efficiency, throughput and memory usage of ECO. Ablation studies show rationale behind our designs.

Rethink Efficiency Side of Neural Combinatorial Solver: An Offline and Self-Play Paradigm

TL;DR

Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput.

Abstract

We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline paradigm; and 3) Progressive Bootstrapping: To stabilize training, we employ a heuristic-based bootstrapping mechanism that ensures continuous policy improvement during training. Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput. We provide further in-depth analysis on the efficiency, throughput and memory usage of ECO. Ablation studies show rationale behind our designs.
Paper Structure (31 sections, 17 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 17 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Convergence efficiency on TSP1000. The curves display the average optimality gap over 20 independent runs, with shaded regions indicating the standard deviation. All methods were trained on a single NVIDIA A800 GPU, with batch sizes tuned to maximize hardware utilization for each respective model.
  • Figure 2: The core designs and workflow of our ECO. Left: The Mamba-based Encoder-Decoder architecture proposed to facilitate offline NCO training. Right: The two-phase offline learning with a SFT warm-up as quick knowledge adaption and an iterative preference learning to further enhance performance.
  • Figure 3: Scalability and Efficiency Analysis on NVIDIA A800 GPU (80GB RAM). We perform a stress test comparing the resource consumption of ECO (Mamba-based) against the Transformer baseline across problem sizes $N$ ranging from 100 to 10,000 (log scale). (a) Peak Memory Usage: The baseline exhibits characteristic quadratic growth ($O(N^2)$) due to the self-attention mechanism, triggering an Out-Of-Memory (OOM) error at $N=10,000$. In contrast, ECO maintains a near-linear memory profile ($O(N)$) owing to Mamba's fixed-size recurrent state, enabling scalable inference well within hardware limits. (b) Inference Throughput: ECO demonstrates orders-of-magnitude higher throughput (note the log scale on the y-axis) compared to the baseline. The performance gap widens significantly as $N$ increases, validating the computational efficiency advantage of our proposed SSM architecture for large-scale NCO tasks.