Efficient Group Lasso Regularized Rank Regression with Data-Driven Parameter Determination
Meixia Lin, Meijiao Shi, Yunhai Xiao, Qian Zhang
TL;DR
Robust, high-dimensional rank regression is achieved by replacing the squared loss with a Wilcoxon rank loss, $L(X\beta - y)$, and enforcing group sparsity via a group lasso penalty, $\lambda\Psi(\beta)$. The authors introduce a data-driven rule for selecting $\lambda$, prove a finite-sample error bound for the estimator, and develop a proximal augmented Lagrangian method (PALM) with a semismooth Newton solver to efficiently solve the resulting optimization problem. The approach yields strong robustness to heavy-tailed noise and scalable performance on large datasets, outperforming cross-validated group lasso baselines in both accuracy and computation time. These contributions advance robust high-dimensional regression with structured sparsity and provide a practical, theoretically grounded toolkit for practitioners.
Abstract
High-dimensional regression often suffers from heavy-tailed noise and outliers, which can severely undermine the reliability of least-squares based methods. To improve robustness, we adopt a non-smooth Wilcoxon score based rank objective and incorporate structured group sparsity regularization, a natural generalization of the lasso, yielding a group lasso regularized rank regression method. By extending the tuning-free parameter selection scheme originally developed for the lasso, we introduce a data-driven, simulation-based tuning rule and further establish a finite-sample error bound for the resulting estimator. On the computational side, we develop a proximal augmented Lagrangian method for solving the associated optimization problem, which eliminates the singularity issues encountered in existing methods, thereby enabling efficient semismooth Newton updates for the subproblems. Extensive numerical experiments demonstrate the robustness and effectiveness of our proposed estimator against alternatives, and showcase the scalability of the algorithm across both simulated and real-data settings.
