Table of Contents
Fetching ...

Near-optimal Linear Predictive Clustering in Non-separable Spaces via Mixed Integer Programming and Quadratic Pseudo-Boolean Reductions

Jiazhou Liang, Hassan Khurram, Scott Sanner

TL;DR

This work tackles LPC in non-separable spaces by extending a globally optimal MIP framework with two near-optimal approaches: LPC-NS-MIP and LPC-NS-QPBO. By leveraging non-separability, it derives a regression-coefficient abstraction that reduces optimization variables and yields a QPBO-based clustering formulation, accompanied by provable error bounds. Empirical results on synthetic and real data show closer alignment to the global optimum and lower regression errors than greedy methods, with substantial scalability improvements over prior MIP-based LPC formulations. The methods offer practical, scalable solutions for partitioning data into clusters with distinct linear relationships in challenging non-separable settings, broadening the applicability of LPC in marketing, medicine, and education.

Abstract

Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in non-separable settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, Bertsimas and Shioda (2007) formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but suffering from poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation's complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving substantial computational improvements in some settings. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.

Near-optimal Linear Predictive Clustering in Non-separable Spaces via Mixed Integer Programming and Quadratic Pseudo-Boolean Reductions

TL;DR

This work tackles LPC in non-separable spaces by extending a globally optimal MIP framework with two near-optimal approaches: LPC-NS-MIP and LPC-NS-QPBO. By leveraging non-separability, it derives a regression-coefficient abstraction that reduces optimization variables and yields a QPBO-based clustering formulation, accompanied by provable error bounds. Empirical results on synthetic and real data show closer alignment to the global optimum and lower regression errors than greedy methods, with substantial scalability improvements over prior MIP-based LPC formulations. The methods offer practical, scalable solutions for partitioning data into clusters with distinct linear relationships in challenging non-separable settings, broadening the applicability of LPC in marketing, medicine, and education.

Abstract

Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in non-separable settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, Bertsimas and Shioda (2007) formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but suffering from poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation's complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving substantial computational improvements in some settings. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.

Paper Structure

This paper contains 47 sections, 49 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The plot illustrates a case where the feature variables (horizontal axes) from samples belonging to two different ground truth clusters (circles and squares) overlap, making them non-separable in the feature space. However, each cluster follows a distinct linear relationship (hyperplane) with the target variable (vertical axis). This non-separability renders CLR (cf. right)—which clusters solely based on feature variables—ineffective: it assigns clusters (shown in blue and red) that mix the ground truth labels. In contrast, LPC (cf. left) recovers the ground truth linear predictor assignment.
  • Figure 2: RQ1 Trade-off between difference from the GlobalOpt objective (y-axis) and optimization time (seconds) (x-axis) across different methods as sample sizes increase. The sample size is limited to 200 for $K = 2$ and 90 for $K = 3$, to allow GlobalOpt to achieve optimality in $\leq 2$ hours (cf. Fig. \ref{['fig:time_n']}). Each data point represents the mean over trials and the solid interval indicates the 95% confidence intervals.
  • Figure 3: Runtime as the number of samples increases for $K = 2$ (left) and as the number of clusters increases (right). LPC-NS-QPBO exhibits better scalability with $N = 2000$ samples, compared to $N = 200$ in GlobalOpt under 2 hours.
  • Figure 4: RQ2 Noise in Target Variables The performance difference across different methods with the 95% confidence interval as the level of Gaussian noise in the target increases.
  • Figure 5: RQ2 Dimensionality of Feature Variables The performance difference across different methods with the 95% confidence interval as the number of feature variables increases, with the mean of results across experimental trials.
  • ...and 5 more figures