Table of Contents
Fetching ...

A GPU-accelerated Nonlinear Branch-and-Bound Framework for Sparse Linear Models

Xiang Meng, Ryan Lucas, Rahul Mazumder

TL;DR

This work tackles exact sparse linear regression with an $\ell_0-\ell_2$ penalty by formulating it as a MIP with a perspective Big-M representation and solving it via a GPU-accelerated branch-and-bound framework. The system combines an ADMM-based node relaxation that allows fully parallel coordinate updates with a batched proximal-gradient upper-bound solver, enabling parallel exploration of multiple BnB nodes on a single GPU. By batching node solves and carefully managing memory, the approach achieves substantial runtime reductions relative to CPU-based methods and commercial solvers, including MOSEK, on both synthetic and real high-dimensional datasets. The results demonstrate robustness to problem scale and parameter settings, illustrating that hardware-aware algorithm design yields practical improvements for certifying global optima in sparse regression problems.

Abstract

We study exact sparse linear regression with an $\ell_0-\ell_2$ penalty and develop a branch-and-bound (BnB) algorithm explicitly designed for GPU execution. Starting from a perspective reformulation, we derive an interval relaxation that can be solved by ADMM with closed-form, coordinate-wise updates. We structure these updates so that the main work at each BnB node reduces to batched matrix-vector operations with a shared data matrix, enabling fine-grained parallelism across coordinates and coarse-grained parallelism across many BnB nodes on a single GPU. Feasible solutions (upper bounds) are generated by a projected gradient method on the active support, implemented in a batched fashion so that many candidate supports are updated in parallel on the GPU. We discuss practical design choices such as memory layout, batching strategies, and load balancing across nodes that are crucial for obtaining good utilization on modern GPUs. On synthetic and real high-dimensional datasets, our GPU-based approach achieves clear runtime improvements over a CPU implementation of our method, an existing specialized BnB method, and commercial MIP solvers.

A GPU-accelerated Nonlinear Branch-and-Bound Framework for Sparse Linear Models

TL;DR

This work tackles exact sparse linear regression with an penalty by formulating it as a MIP with a perspective Big-M representation and solving it via a GPU-accelerated branch-and-bound framework. The system combines an ADMM-based node relaxation that allows fully parallel coordinate updates with a batched proximal-gradient upper-bound solver, enabling parallel exploration of multiple BnB nodes on a single GPU. By batching node solves and carefully managing memory, the approach achieves substantial runtime reductions relative to CPU-based methods and commercial solvers, including MOSEK, on both synthetic and real high-dimensional datasets. The results demonstrate robustness to problem scale and parameter settings, illustrating that hardware-aware algorithm design yields practical improvements for certifying global optima in sparse regression problems.

Abstract

We study exact sparse linear regression with an penalty and develop a branch-and-bound (BnB) algorithm explicitly designed for GPU execution. Starting from a perspective reformulation, we derive an interval relaxation that can be solved by ADMM with closed-form, coordinate-wise updates. We structure these updates so that the main work at each BnB node reduces to batched matrix-vector operations with a shared data matrix, enabling fine-grained parallelism across coordinates and coarse-grained parallelism across many BnB nodes on a single GPU. Feasible solutions (upper bounds) are generated by a projected gradient method on the active support, implemented in a batched fashion so that many candidate supports are updated in parallel on the GPU. We discuss practical design choices such as memory layout, batching strategies, and load balancing across nodes that are crucial for obtaining good utilization on modern GPUs. On synthetic and real high-dimensional datasets, our GPU-based approach achieves clear runtime improvements over a CPU implementation of our method, an existing specialized BnB method, and commercial MIP solvers.
Paper Structure (28 sections, 1 theorem, 47 equations, 7 figures, 5 tables, 3 algorithms)

This paper contains 28 sections, 1 theorem, 47 equations, 7 figures, 5 tables, 3 algorithms.

Key Result

Proposition 1

We define function $h:\mathbb{R}_{\ge0}\to\mathbb{R}$ as A dual of problem eq:ADMM1 is then given by where Let $(b^*,\beta^*)$ be an optimal solution to problem eq:ADMM1, then $r^*=y-Xb^*$ is the optimal dual solution of problem eq:dual. Moreover, strong duality holds for eq:ADMM1.

Figures (7)

  • Figure 1: High-level view of our GPU-accelerated BnB algorithm. A batch $\mathcal{B}$ of $K$ active nodes from the CPU-side BnB tree is mapped to batched node subproblem algorithms on the GPU. Each node $u \in \mathcal{B}$ is solved in parallel and the resulting lower ($\text{LB}_u$) and upper bounds ($\text{UB}_u$) are fed back to guide branching and pruning decisions on CPU.
  • Figure 2: Illustration of the Alternating Direction Method of Multipliers (ADMM) updates. Left: ADMM allows for operator splitting by separating the optimization problem into sequential updates of primal variables $b$, auxiliary variables $\beta$, and dual variables $v$. Each variable is updated by minimizing the augmented Lagrangian $L_\rho$ with respect to that variable while keeping the others fixed. Right: These updates can be decomposed across the coordinates of all vectors for fast parallel computation. Each component $b_i^{(t+1)}$, $\beta_i^{(t+1)}$, $v_i^{(t+1)}$ for $i = 1, \dots, p$ is updated independently across coordinates, which enables fast parallel updates on GPUs.
  • Figure 3: SIMD model for parallel lower bound algorithm
  • Figure 4: Average node relaxation solving time (in seconds) of our approach versus L0BnB on 10 random seeds.
  • Figure 5: Number of nodes solved within one hour with CPUBnB/GPUBnB (without node parallelism) across different computing platforms. Results are shown for 10 random seeds on synthetic datasets with $n=3000$ samples, $p=30000$ features, and $\mathrm{SNR}=0.1$.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 1