Table of Contents
Fetching ...

Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics

Andersen Ang, Jianzhu Ma, Nianjun Liu, Kun Huang, Yijie Wang

TL;DR

The paper tackles projecting a vector onto the $k$-capped simplex, a constraint set combining a box with a linear cap. It Reformulates the problem as a scalar dual minimization in a Lagrange multiplier $\gamma$, and solves it efficiently using Newton's method, with a closed-form primal update $\mathbf{x}^*(\gamma) = \min\{\mathbf{1}, [\mathbf{y}-\gamma\mathbf{1}]_+\}$. The authors prove that the dual objective $\omega(\gamma)$ is convex and $n$-smooth, enabling fast convergence and yielding substantial runtime gains over sorting-based methods, especially for very large $n$. They demonstrate the method's practicality by accelerating sparse-regression procedures on GWAS-scale bioinformatics data, achieving 3–6x speedups over state-of-the-art approaches and enabling large-scale analyses that were previously challenging.

Abstract

We consider the problem of projecting a vector onto the so-called k-capped simplex, which is a hyper-cube cut by a hyperplane. For an n-dimensional input vector with bounded elements, we found that a simple algorithm based on Newton's method is able to solve the projection problem to high precision with a complexity roughly about O(n), which has a much lower computational cost compared with the existing sorting-based methods proposed in the literature. We provide a theory for partial explanation and justification of the method. We demonstrate that the proposed algorithm can produce a solution of the projection problem with high precision on large scale datasets, and the algorithm is able to significantly outperform the state-of-the-art methods in terms of runtime (about 6-8 times faster than a commercial software with respect to CPU time for input vector with 1 million variables or more). We further illustrate the effectiveness of the proposed algorithm on solving sparse regression in a bioinformatics problem. Empirical results on the GWAS dataset (with 1,500,000 single-nucleotide polymorphisms) show that, when using the proposed method to accelerate the Projected Quasi-Newton (PQN) method, the accelerated PQN algorithm is able to handle huge-scale regression problem and it is more efficient (about 3-6 times faster) than the current state-of-the-art methods.

Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics

TL;DR

The paper tackles projecting a vector onto the -capped simplex, a constraint set combining a box with a linear cap. It Reformulates the problem as a scalar dual minimization in a Lagrange multiplier , and solves it efficiently using Newton's method, with a closed-form primal update . The authors prove that the dual objective is convex and -smooth, enabling fast convergence and yielding substantial runtime gains over sorting-based methods, especially for very large . They demonstrate the method's practicality by accelerating sparse-regression procedures on GWAS-scale bioinformatics data, achieving 3–6x speedups over state-of-the-art approaches and enabling large-scale analyses that were previously challenging.

Abstract

We consider the problem of projecting a vector onto the so-called k-capped simplex, which is a hyper-cube cut by a hyperplane. For an n-dimensional input vector with bounded elements, we found that a simple algorithm based on Newton's method is able to solve the projection problem to high precision with a complexity roughly about O(n), which has a much lower computational cost compared with the existing sorting-based methods proposed in the literature. We provide a theory for partial explanation and justification of the method. We demonstrate that the proposed algorithm can produce a solution of the projection problem with high precision on large scale datasets, and the algorithm is able to significantly outperform the state-of-the-art methods in terms of runtime (about 6-8 times faster than a commercial software with respect to CPU time for input vector with 1 million variables or more). We further illustrate the effectiveness of the proposed algorithm on solving sparse regression in a bioinformatics problem. Empirical results on the GWAS dataset (with 1,500,000 single-nucleotide polymorphisms) show that, when using the proposed method to accelerate the Projected Quasi-Newton (PQN) method, the accelerated PQN algorithm is able to handle huge-scale regression problem and it is more efficient (about 3-6 times faster) than the current state-of-the-art methods.

Paper Structure

This paper contains 35 sections, 4 theorems, 18 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

The function $\omega(\gamma)$ is convex and twice differentiable with where $I$ is an indicator function that $I_A = 1$ if $A$ is true and $I_A = 0$ otherwise.

Figures (5)

  • Figure 1: Illustration on solving Problem \ref{['optr']} on a toy example with $\mathbf{y} = [0.1, 1.5, -1] \in \mathbb{R}^3, k=1.5$ and $\gamma_0 = -1.1$. The line $l_2$ intersects the piecewise-liner function $l_1$ at $\gamma^*$, which is the root of $l_2$. The idea of the sorting-based methods is to find such root by sorting the 3 elements of $\mathbf{y}$ and try each of the 3 linear segment of $l_1$ to find the root. The proposed method (Algorithm \ref{['algo:newton']}) solves the problem in 2 iterations, hence it gains speed-up. In the middle we plot $\omega^{\prime\prime}$ and the iteration of $\gamma_t$. The right-most plots show the iteration of $\omega^\prime$ and $\omega^{\prime\prime}$ (see Theorem \ref{['thm1']} for their explicit expression). Note that $l_1, l_2$ may not intersect each other for some $( \mathbf{y},k)$, in this case $l_2$ has no root and the sorting-based methods do not work. However, the proposed algorithm can still produce an approximate solution.
  • Figure 2: Comparison between the proposed projection method and the Gurobi projection for $\alpha \in \{1,10,100,1000 \}$. The thick curves are the median of the results of over 100 experiments on 100 datasets. The figure shows the superior performance of the proposed method in terms of runtime.
  • Figure 3: Comparison between different methods. The results are the average over 100 datasets, and all the error bars are in mean$\pm$std. From left to right, the sub-figures are: (a) The Acc values between different methods when $n=10^3$, $k=10$, $p=0.2$, and SNR$=6$; (b) The computation time of the algorithms in (a); (c) The Acc values between different methods when $p$ changed to $0.7$; and (d) The computation time of the algorithms in (c).
  • Figure 4: Convergence comparison between PQN with the proposed projection algorithm and PQN with Gurobi projection. From left to right, the sub-figures are: (a) Comparison on chromosome 20; (b)Comparison on chromosome 5; and (c) Comparison on chromosome 2.
  • Figure 5: Convergence comparison (plotted in error bar) between PQN + our projection, PQN + Gurobi, and the SS method. (a) Convergence comparison on a one dataset. (b) Converge comparison on 10 different datasets.

Theorems & Definitions (8)

  • Remark 1
  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof : Proof of Theorem \ref{['thm1']}
  • Corollary 2
  • proof