Table of Contents
Fetching ...

biniLasso: Automated cut-point detection via sparse cumulative binarization

Abdollah Safari, Hamed Halisaz, Peter Loewen

TL;DR

This work tackles data-driven cut-point detection in high-dimensional survival analysis by introducing biniLasso, which uses cumulative binarization to model continuous predictors within a Cox framework, and its sparse variant miniLasso that enforces sparsity via uniLasso with sign-consistency. The methods enable detection of multiple cut-points per feature while delivering substantial computational gains (2–8× faster than the state-of-the-art binacox) and competitive predictive accuracy, as shown in extensive simulations. In three TCGA cancer datasets, biniLasso and miniLasso identify meaningful cut-points, offer stable risk stratification, and often surpass binacox in interpretability and performance (AIC, IBS, C-index) when evaluated against CGAM and continuous models. The approach generalizes to other GLMs, providing a practical, interpretable toolkit for high-dimensional prognostic modeling and risk stratification.

Abstract

We present biniLasso and its sparse variant (sparse biniLasso), novel methods for prognostic analysis of high-dimensional survival data that enable detection of multiple cut-points per feature. Our approach leverages the Cox proportional hazards model with two key innovations: (1) a cumulative binarization scheme with $L_1$-penalized coefficients operating on context-dependent cut-point candidates, and (2) for sparse biniLasso, additional uniLasso regularization to enforce sparsity while preserving univariate coefficient patterns. These innovations yield substantially improved interpretability, computational efficiency (4-11x faster than existing approaches), and prediction performance. Through extensive simulations, we demonstrate superior performance in cut-point detection, particularly in high-dimensional settings. Application to three genomic cancer datasets from TCGA confirms the methods' practical utility, with both variants showing enhanced risk prediction accuracy compared to conventional techniques.

biniLasso: Automated cut-point detection via sparse cumulative binarization

TL;DR

This work tackles data-driven cut-point detection in high-dimensional survival analysis by introducing biniLasso, which uses cumulative binarization to model continuous predictors within a Cox framework, and its sparse variant miniLasso that enforces sparsity via uniLasso with sign-consistency. The methods enable detection of multiple cut-points per feature while delivering substantial computational gains (2–8× faster than the state-of-the-art binacox) and competitive predictive accuracy, as shown in extensive simulations. In three TCGA cancer datasets, biniLasso and miniLasso identify meaningful cut-points, offer stable risk stratification, and often surpass binacox in interpretability and performance (AIC, IBS, C-index) when evaluated against CGAM and continuous models. The approach generalizes to other GLMs, providing a practical, interpretable toolkit for high-dimensional prognostic modeling and risk stratification.

Abstract

We present biniLasso and its sparse variant (sparse biniLasso), novel methods for prognostic analysis of high-dimensional survival data that enable detection of multiple cut-points per feature. Our approach leverages the Cox proportional hazards model with two key innovations: (1) a cumulative binarization scheme with -penalized coefficients operating on context-dependent cut-point candidates, and (2) for sparse biniLasso, additional uniLasso regularization to enforce sparsity while preserving univariate coefficient patterns. These innovations yield substantially improved interpretability, computational efficiency (4-11x faster than existing approaches), and prediction performance. Through extensive simulations, we demonstrate superior performance in cut-point detection, particularly in high-dimensional settings. Application to three genomic cancer datasets from TCGA confirms the methods' practical utility, with both variants showing enhanced risk prediction accuracy compared to conventional techniques.

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: (A) Original continuous predictor $X_1$ used as input. (B) True threshold relationship in Scenarios 1, 2, and 4, with two true cut‑points. (C) Gradual "cut‑region" relationship in Scenario 3.
  • Figure 2: Results for benchmark Scenarios 1 and 2. For Scenario 1: average computation time (A) and bias in estimated cut‑points (B) across $n$'s. For Scenario 2: average computation time (C) and bias (D) across $P$'s. Results are shown for biniLasso (blue), miniLasso (green), and binacox (purple). Vertical bars represent ± 1 SD over 5000 simulations.
  • Figure 3: Results for Scenario 3 (no true cut-points). Average computing time (A), AIC (B), IBS (C), and number of estimated cut-points for $X_1$ (D) for biniLasso (blue), miniLasso (green), binacox (purple), and the true continuous model (red). Results are shown across $n$'s and vertical bars represent ± 1 SD over 5000 simulations.
  • Figure 4: The detected cut-points for the selected 8 genes with the highest number of cut-points across all three methods versus log relative hazard from a CGAM smooth fit in GBM data.