A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data

Madhav Sankaranarayanan; Intekhab Hossain; Tom Chen

A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data

Madhav Sankaranarayanan, Intekhab Hossain, Tom Chen

TL;DR

This paper introduces an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model and its application in student scoring and protein expression.

Abstract

Recent advancements in Mixed Integer Optimization (MIO) algorithms, paired with hardware enhancements, have led to significant speedups in resolving MIO problems. These strategies have been utilized for optimal subset selection, specifically for choosing $k$ features out of $p$ in linear regression given $n$ observations. In this paper, we broaden this method to facilitate cluster-aware regression, where selection aims to choose $λ$ out of $K$ clusters in a linear mixed effects (LMM) model with $n_k$ observations for each cluster. Through comprehensive testing on a multitude of synthetic and real datasets, we exhibit that our method efficiently solves problems within minutes. Through numerical experiments, we also show that the MIO approach outperforms both Gaussian- and Laplace-distributed LMMs in terms of generating sparse solutions with high predictive power. Traditional LMMs typically assume that clustering effects are independent of individual features. However, we introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model. The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression.

A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data

TL;DR

Abstract

features out of

in linear regression given

observations. In this paper, we broaden this method to facilitate cluster-aware regression, where selection aims to choose

out of

clusters in a linear mixed effects (LMM) model with

observations for each cluster. Through comprehensive testing on a multitude of synthetic and real datasets, we exhibit that our method efficiently solves problems within minutes. Through numerical experiments, we also show that the MIO approach outperforms both Gaussian- and Laplace-distributed LMMs in terms of generating sparse solutions with high predictive power. Traditional LMMs typically assume that clustering effects are independent of individual features. However, we introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model. The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression.

Paper Structure (22 sections, 6 equations, 4 figures, 1 table)

This paper contains 22 sections, 6 equations, 4 figures, 1 table.

Introduction
Problem setup
Brief context and background
Our approach
Methods
General MIO formulation
Algorithmic pipeline
Modelling
Classification and assignments
Simulation studies
Recovery of causal effects
Predictive performance
Computational cost
Data examples
Student performance
...and 7 more sections

Figures (4)

Figure 1: Flowchart of the experimental pipeline, also outlining the key components of our model
Figure 2: $\ell_2$ causal effect recovery on the log scale ($\beta$ (above) and $\gamma$ (below)) under the simulation scenarios where the cluster-effects are truly Gaussian (left) and truly sparse (right), with high dimensionality (14 clusters, 35 covariates, 50 observations per cluster)
Figure 3: Predictive MSE of the four algorithms in a high dimensional setup (14 clusters, 35 covariates, 50 observations per cluster), where the cluster-effects are truly Gaussian (left) and truly sparse (right)
Figure 4: (Left) Predicted random effects assuming Laplace-distributed, Gaussian-distributed, and MIO and (Right) Concordance of cluster effects with raw scores across different algorithms. The points are colored based on the raw average scores (green = top 10%, purple = bottom 10%)

A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data

TL;DR

Abstract

A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data

Authors

TL;DR

Abstract

Table of Contents

Figures (4)