Data-driven configuration tuning of glmnet to balance accuracy and computation time
Shuhei Muroya, Kei Hirose
TL;DR
The paper tackles the problem that glmnet's default configuration can yield inaccurate lasso solutions on high-dimensional, correlated data due to suboptimal convergence and lambda grids. It proposes a data-driven framework that generates large-scale synthetic datasets, trains a neural predictor (glmnet-MLP) to map data characteristics and configuration to Solution Path Error ($\mathrm{SPE}$) and runtime, and then uses Pareto front optimization to select a glmnet configuration within a user-specified time budget $T_{\text{hope}}$. Key contributions include the construction of a large summary dataset (over 810k samples), a Bayesian-optimized neural architecture for predicting SPE and computation time, and a Pareto-front based tuning mechanism implemented in the glmnetconf R package, which can also choose between glmnet and LARS. Empirically, the method achieves accuracy close to the exact LARS solution while significantly reducing computation time, demonstrated in both synthetic simulations and compressed sensing tasks; future work extends training coverage and applies the framework to other GLMs. The approach provides a transparent, fast, and scalable way to balance accuracy and efficiency in regularized regression settings.
Abstract
glmnet is a widely adopted R package for lasso estimation due to its computational efficiency. Despite its popularity, glmnet sometimes yields solutions that are substantially different from the true ones because of the inappropriate default configuration of the algorithm. The accuracy of the obtained solutions can be improved by appropriately tuning the configuration. However, improving accuracy typically increases computational time, resulting in a trade-off between accuracy and computational efficiency. Therefore, it is essential to establish a systematic approach to determine appropriate configuration. To address this need, we propose a unified data-driven framework specifically designed to optimize the configuration by balancing the trade-off between accuracy and computational efficiency. We generate large-scale simulated datasets and apply glmnet under various configurations to obtain accuracy and computation time. Based on these results, we construct neural networks that predict accuracy and computation time from data characteristics and configuration. Given a new dataset, our framework uses the neural networks to explore the configuration space and derive a Pareto front that represents the trade-off between accuracy and computational cost. This front allows us to automatically identify the configuration that maximize accuracy under a user-specified time constraint. The proposed method is implemented in the R package 'glmnetconf', available at https://github.com/Shuhei-Muroya/glmnetconf.
