Table of Contents
Fetching ...

Pruning for efficient deterministic global optimization over trained ReLU neural networks

Giacomo Lastrucci, Tanuj Karia, Victor Schulte, Dominik Bongartz, Artur M. Schweidtmann

Abstract

Neural networks are increasingly used as surrogates in optimization problems to replace computationally expensive models. However, embedding ReLU neural networks in mathematical programs introduces significant computational challenges, particularly for deep and wide networks, due to both the formulation of the ReLU disjunction and the resulting large-scale optimization problem. This work investigates how pruning techniques can accelerate the solution of optimization problems with embedded neural networks, focusing on the mechanisms underlying the computational gains. We provide theoretical insights into how both unstructured (weight) and structured (node) pruning affect the ReLU big-M formulation, showing that pruning monotonically tightens preactivation bounds. We conduct comprehensive empirical studies across multiple network architectures using an illustrative test function and a realistic chemical process flowsheet optimization case study. Our results show that pruning achieves speedups of up to three to four orders of magnitude, with computational gains attributed to three key factors: (i) reduction in problem size, (ii) decrease in the number of integer variables, and (iii) tightening of big-M bounds. Weight pruning is particularly effective for deep, narrow networks, while node pruning performs better for shallow, wide or medium-sized networks. In the chemical engineering case study, pruning enabled convergence within seconds for problems that were otherwise intractable. We recommend adopting pruning as standard practice when developing neural network surrogates for optimization, especially for engineering applications requiring repeated optimization solves.

Pruning for efficient deterministic global optimization over trained ReLU neural networks

Abstract

Neural networks are increasingly used as surrogates in optimization problems to replace computationally expensive models. However, embedding ReLU neural networks in mathematical programs introduces significant computational challenges, particularly for deep and wide networks, due to both the formulation of the ReLU disjunction and the resulting large-scale optimization problem. This work investigates how pruning techniques can accelerate the solution of optimization problems with embedded neural networks, focusing on the mechanisms underlying the computational gains. We provide theoretical insights into how both unstructured (weight) and structured (node) pruning affect the ReLU big-M formulation, showing that pruning monotonically tightens preactivation bounds. We conduct comprehensive empirical studies across multiple network architectures using an illustrative test function and a realistic chemical process flowsheet optimization case study. Our results show that pruning achieves speedups of up to three to four orders of magnitude, with computational gains attributed to three key factors: (i) reduction in problem size, (ii) decrease in the number of integer variables, and (iii) tightening of big-M bounds. Weight pruning is particularly effective for deep, narrow networks, while node pruning performs better for shallow, wide or medium-sized networks. In the chemical engineering case study, pruning enabled convergence within seconds for problems that were otherwise intractable. We recommend adopting pruning as standard practice when developing neural network surrogates for optimization, especially for engineering applications requiring repeated optimization solves.
Paper Structure (33 sections, 2 theorems, 29 equations, 14 figures, 10 tables, 2 algorithms)

This paper contains 33 sections, 2 theorems, 29 equations, 14 figures, 10 tables, 2 algorithms.

Key Result

Proposition 1

Consider a feedforward ReLU network with input box $\mathbf{x} \equiv \mathbf{z}^{(0)}\in[\mathbf{L}^{(0)},\mathbf{U}^{(0)}]$ and preactivation $\mathbf{p}$. Let $(\mathbf{L}^{(j)},\mathbf{U}^{(j)})$ denote the standard interval-arithmetic (IA) preactivation bounds defined as in Eqs. eq:lower_bound-IA--eq:upper_bound-IA and let the ac be the corresponding interval widths. Let $\mathbf{W}'^{(j)}$

Figures (14)

  • Figure 1: Graphical abstract.
  • Figure 2: (a) Illustration of a dead neuron with pruned incoming weights and consequent flat activation and (b) an illustrative example for the dead neuron cleaning, where the biases of the neurons in the subsequent layer are adjusted to account for the constant activation of the dead neuron, with consequent safe removal of the output weights.
  • Figure 3: Bound tightening and resulting relaxation of ReLU.
  • Figure 4: Comparison of final regression accuracy measured relative to the unpruned baseline across different pruning levels.
  • Figure 5: Optimization time as a function of pruning level for different network architectures. The horizontal dashed line indicates the time limit of 7,200 seconds. Weight pruning (a) requires aggressive sparsity (99%) for convergence in deep networks, while node pruning (b) achieves convergence at lower sparsity levels (80%) across all architectures.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Proposition 1: Monotonic tightening of big-M bounds under pruning
  • proof
  • Proposition 2: Strict tightening under mild conditions
  • proof