Table of Contents
Fetching ...

Optimal Risk Scores for Continuous Predictors

Cristina Molero-Río, Claudia D'Ambrosio

TL;DR

This work tackles learning interpretable risk scores when predictors are continuous, by casting the problem as a mixed-integer nonlinear optimization that minimizes logistic loss under sparsity constraints. It extends prior binary-input risk-score formulations to continuous predictors by learning thresholds and introducing linearizations (bilinear and Fortet-based) to maintain tractability. Computational experiments on synthetic data reveal a severe scalability gap for state-of-the-art solvers, motivating a simple, effective matheuristic that delivers fast, competitive feasible solutions with high predictive performance. The approach promises interpretable, threshold-tuned risk scores suitable for high-stakes domains, and it outlines clear directions for scaling to larger datasets via MILO reformulations and enhanced modeling techniques.

Abstract

In this paper, we propose a novel Mixed-Integer Non-Linear Optimization formulation to construct a risk score, where we optimize the logistic loss with sparsity constraints. Previous approaches are typically designed to handle binary datasets, where continuous predictor variables are discretized in a preprocessing step by using arbitrary thresholds, such as quantiles. In contrast, we allow the model to decide for each continuous predictor variable the particular threshold that is critical for prediction. The usefulness of the resulting optimization problem is tested in synthetic datasets.

Optimal Risk Scores for Continuous Predictors

TL;DR

This work tackles learning interpretable risk scores when predictors are continuous, by casting the problem as a mixed-integer nonlinear optimization that minimizes logistic loss under sparsity constraints. It extends prior binary-input risk-score formulations to continuous predictors by learning thresholds and introducing linearizations (bilinear and Fortet-based) to maintain tractability. Computational experiments on synthetic data reveal a severe scalability gap for state-of-the-art solvers, motivating a simple, effective matheuristic that delivers fast, competitive feasible solutions with high predictive performance. The approach promises interpretable, threshold-tuned risk scores suitable for high-stakes domains, and it outlines clear directions for scaling to larger datasets via MILO reformulations and enhanced modeling techniques.

Abstract

In this paper, we propose a novel Mixed-Integer Non-Linear Optimization formulation to construct a risk score, where we optimize the logistic loss with sparsity constraints. Previous approaches are typically designed to handle binary datasets, where continuous predictor variables are discretized in a preprocessing step by using arbitrary thresholds, such as quantiles. In contrast, we allow the model to decide for each continuous predictor variable the particular threshold that is critical for prediction. The usefulness of the resulting optimization problem is tested in synthetic datasets.

Paper Structure

This paper contains 15 sections, 12 equations, 1 figure, 4 tables, 1 algorithm.

Figures (1)

  • Figure 1: Risk score from liu2022 on the mammo dataset, which consists of a sample of biopsy patients. The model tries to predict the risk of malignancy of a breast lesion.