Table of Contents
Fetching ...

Sparse-Input Neural Network using Group Concave Regularization

Bin Luo, Susan Halabi

TL;DR

The paper tackles high-dimensional predictive modeling by jointly selecting input features and estimating nonlinear functions via sparse-input neural networks with group concave regularization. It introduces a ridge-stabilized objective that applies a concave group penalty to each input-node's outgoing weights, coupled with a backward path-wise optimization to produce stable solution paths. The authors establish non-asymptotic estimation and prediction guarantees and prove an oracle property under standard high-dimensional conditions, with extensive simulations and real-data applications (continuous, binary, and time-to-event outcomes) demonstrating improved feature selection and competitive predictive performance. The work has practical implications for interpretable nonlinear modeling in HDDA, offering a computationally efficient path-wise training strategy and solid theoretical backing for variable selection consistency in neural networks.

Abstract

Simultaneous feature selection and non-linear function estimation is challenging in modeling, especially in high-dimensional settings where the number of variables exceeds the available sample size. In this article, we investigate the problem of feature selection in neural networks. Although the group least absolute shrinkage and selection operator (LASSO) has been utilized to select variables for learning with neural networks, it tends to select unimportant variables into the model to compensate for its over-shrinkage. To overcome this limitation, we propose a framework of sparse-input neural networks using group concave regularization for feature selection in both low-dimensional and high-dimensional settings. The main idea is to apply a proper concave penalty to the $l_2$ norm of weights from all outgoing connections of each input node, and thus obtain a neural net that only uses a small subset of the original variables. In addition, we develop an effective algorithm based on backward path-wise optimization to yield stable solution paths, in order to tackle the challenge of complex optimization landscapes. We provide a rigorous theoretical analysis of the proposed framework, establishing finite-sample guarantees for both variable selection consistency and prediction accuracy. These results are supported by extensive simulation studies and real data applications, which demonstrate the finite-sample performance of the estimator in feature selection and prediction across continuous, binary, and time-to-event outcomes.

Sparse-Input Neural Network using Group Concave Regularization

TL;DR

The paper tackles high-dimensional predictive modeling by jointly selecting input features and estimating nonlinear functions via sparse-input neural networks with group concave regularization. It introduces a ridge-stabilized objective that applies a concave group penalty to each input-node's outgoing weights, coupled with a backward path-wise optimization to produce stable solution paths. The authors establish non-asymptotic estimation and prediction guarantees and prove an oracle property under standard high-dimensional conditions, with extensive simulations and real-data applications (continuous, binary, and time-to-event outcomes) demonstrating improved feature selection and competitive predictive performance. The work has practical implications for interpretable nonlinear modeling in HDDA, offering a computationally efficient path-wise training strategy and solid theoretical backing for variable selection consistency in neural networks.

Abstract

Simultaneous feature selection and non-linear function estimation is challenging in modeling, especially in high-dimensional settings where the number of variables exceeds the available sample size. In this article, we investigate the problem of feature selection in neural networks. Although the group least absolute shrinkage and selection operator (LASSO) has been utilized to select variables for learning with neural networks, it tends to select unimportant variables into the model to compensate for its over-shrinkage. To overcome this limitation, we propose a framework of sparse-input neural networks using group concave regularization for feature selection in both low-dimensional and high-dimensional settings. The main idea is to apply a proper concave penalty to the norm of weights from all outgoing connections of each input node, and thus obtain a neural net that only uses a small subset of the original variables. In addition, we develop an effective algorithm based on backward path-wise optimization to yield stable solution paths, in order to tackle the challenge of complex optimization landscapes. We provide a rigorous theoretical analysis of the proposed framework, establishing finite-sample guarantees for both variable selection consistency and prediction accuracy. These results are supported by extensive simulation studies and real data applications, which demonstrate the finite-sample performance of the estimator in feature selection and prediction across continuous, binary, and time-to-event outcomes.
Paper Structure (35 sections, 8 theorems, 90 equations, 9 figures, 3 tables)

This paper contains 35 sections, 8 theorems, 90 equations, 9 figures, 3 tables.

Key Result

Theorem 4.1

For any $\tilde{\lambda} > 0$ and $T \ge 1$ let Suppose Conditions cond:strong_convexity-as:penalty_full hold. Let $\hat{{\mathbf w}}$ be a local minimizer of equation eq:constrained_problem such that $\max_{j\in S^c}||\hat{{\mathbf W}}_{0,j}||_2\le \delta\lambda$. Then over the set $\mathcal{T}_{\tilde{\lambda},T}$, for any tuning parameter $\la where

Figures (9)

  • Figure 1: Solution path of $l_2$ norm of the weight vector associated with each input node $\|{\mathbf W}_{0j}\|_2$. Each line shows the path for a single variable's group weight norm; green lines correspond to the four true informative variables and gray dashed lines to the sixteen nuisance variables. The bottom panels illustrate the "smoother" solution paths generated by our backward pathwise strategy, where weight norms change predictably without the erratic jumps seen in the non-pathwise (top left) and forward pathwise (top middle, right) approaches. Note that the x-axis, $\log(\lambda)$, differs between plots because the absolute scale of the regularization parameter is not directly comparable across penalty types. The individual panel descriptions are as follows: Top left: Non-pathwise optimization using GMCP. All the neural network weights are initialized by drawing from $N(0, 0.1)$ for each $\lambda$. Top middle: forward path-wise optimization using GMCP. It starts from the null model and computes the solution with decreasing $\lambda$. Random initialization is used before the selection of the first set of variables. Top right:forward path-wise optimization using GLASSO. Bottom left: backward path-wise optimization using GSCAD. Bottom middle: backward path-wise optimization using GMCP. Bottom right: backward path-wise optimization using GLASSO.
  • Figure 2: Top row: simulation results for the model of XOR-type signals. Bottom row: simulation results for the model of hierarchical signals. The $R^2$ scores, false positive rate (FPR), and false negative rate (FNR) are presented in the left, middle, and right columns, respectively. The central lines are the means, while the shaded areas represent standard deviations.
  • Figure 3: Top row: $R^2$ score of the proposed methods for the regression model outlined in Example \ref{['ex:regression']}. Middle row: Accuracy of the proposed methods for the classification model outlined in Example \ref{['ex:classification']}. Bottom row: C-Index of the proposed methods for the survival model outlined in Example \ref{['ex:survival']}. The dashed lines represent the median score of the Oracle-NN, used as a benchmark for comparison.
  • Figure 4: Sensitivity analysis of GSCADNet to hyperparameters: Left: $\gamma$, the scaling factor for the thresholding operator. Middle: Learning rate (LR) in the Adam optimizer. Right: Network structure. The network structure $[l_1, l_2, \ldots, l_k]$ represents the number of nodes in each hidden layer. The fixed choices used in our numerical study ($\gamma = 1$, LR $= 0.001$, and network structure $[10, 5]$) are marked on the plots with an"x" symbol for each metric ($R^2$, FPR, and FNR).
  • Figure 5: Left: Boxplots of tAUC from testing set over 100 random splits. Middle: the number of selected variables for GLASSONet, GMCPNet, and GSCADNet. Right: Variables selected by GSCADNet with selection proportion$\ge 10\%$ over 100 random splits.
  • ...and 4 more figures

Theorems & Definitions (19)

  • Remark 4.1
  • Theorem 4.1
  • Remark 4.2
  • Theorem 4.2: Probabilistic Guarantee for Event $\mathcal{T}_{\tilde{\lambda},T}$
  • Theorem 4.3: Existence of a Stationary Point with Oracle Properties
  • Example 5.1
  • Example 5.2
  • Example 5.3
  • Lemma B.1: Quadratic lower bound
  • proof
  • ...and 9 more