Table of Contents
Fetching ...

Modeling with Categorical Features via Exact Fusion and Sparsity Regularisation

Kayhan Behdin, Riade Benbaki, Peter Radchenko, Rahul Mazumder

Abstract

We study the high-dimensional linear regression problem with categorical predictors that have many levels. We propose a new estimation approach, which performs model compression via two mechanisms by simultaneously encouraging (a) clustering of the regression coefficients to collapse some of the categorical levels together; and (b) sparsity of the regression coefficients. We present novel mixed integer programming formulations for our estimator, and develop a custom row generation procedure to speed up the exact off-the-shelf solvers. We also propose a fast approximate algorithm for our method that obtains high-quality feasible solutions via block coordinate descent. As the main building block of our algorithm, we develop an exact algorithm for the univariate case based on dynamic programming, which can be of independent interest. We establish new theoretical guarantees for both the prediction and the cluster recovery performance of our estimator. Our numerical experiments on synthetic and real datasets demonstrate that our proposed estimator tends to outperform the state-of-the-art.

Modeling with Categorical Features via Exact Fusion and Sparsity Regularisation

Abstract

We study the high-dimensional linear regression problem with categorical predictors that have many levels. We propose a new estimation approach, which performs model compression via two mechanisms by simultaneously encouraging (a) clustering of the regression coefficients to collapse some of the categorical levels together; and (b) sparsity of the regression coefficients. We present novel mixed integer programming formulations for our estimator, and develop a custom row generation procedure to speed up the exact off-the-shelf solvers. We also propose a fast approximate algorithm for our method that obtains high-quality feasible solutions via block coordinate descent. As the main building block of our algorithm, we develop an exact algorithm for the univariate case based on dynamic programming, which can be of independent interest. We establish new theoretical guarantees for both the prediction and the cluster recovery performance of our estimator. Our numerical experiments on synthetic and real datasets demonstrate that our proposed estimator tends to outperform the state-of-the-art.

Paper Structure

This paper contains 51 sections, 19 theorems, 145 equations, 8 figures, 10 tables, 3 algorithms.

Key Result

Theorem 1

Let $(\hat{\alpha},\hat{\boldsymbol{\beta}})$ be a global optimal solution to Problem clusteringproblem-reg with $\lambda_0\geq c_{\lambda_0}\sigma^2 \log (ep)/n$ for some sufficiently large $c_{\lambda_0}>0$. Then, with high probability,An explicit expression for the probability can be found in thm

Figures (8)

  • Figure 1: The regression coefficients for the illustrative example in Section \ref{['sec:intro-approach']}. We consider three values of parameter $\lambda$, which controls the strength of the fusion penalty, and two values of $\lambda_0$, which controls the strength of the sparsity penalty. The first 23 coefficients correspond to the hour of the day (sorted based on their coefficient values when $\lambda_0=50$), while the rest represent the weekday (sorted similarly). The estimators in the two left-most panels (nearly) attain the best out-of-sample performance across $\lambda$ values with a test set $R^2$ of 0.26. See Appendix \ref{['supp:illustrative']} for the interpretation of the regression coefficients.
  • Figure 2: Experiments from Section \ref{['results-set1']} with $r_1=4,r_2=12, q=20, q_s=3$ and $\rho=0.2,\sigma=2$. We do not plot the purity and the number of clusters for Elastic Net as it produces a large number of clusters. The vertical bars at each point indicate the corresponding standard errors.
  • Figure 3: Experiments from Section \ref{['results-set1']} with $r_1=r_2=10,q=20,q_s=5, \rho=0.2$ and $\sigma=2$ with varying $n$. We do not plot purity and the number of clusters for Elastic Net as it produces a large number of clusters. The vertical bars at each point indicate the corresponding standard errors.
  • Figure H.1: Experiments from Appendix \ref{['supp:synthetic']} with $r_1=4,r_2=12, q=20, q_s=3$ and $\rho=0.2,\sigma=1.5$ with varying $n$. We do not plot purity and the number of clusters for Elastic Net as it produces a large number of clusters. The vertical bars at each point indicate the corresponding standard errors.
  • Figure H.2: Experiments from Appendix \ref{['supp:synthetic']} with $r_1=4,r_2=12, q=20, q_s=3$ and $\rho=0.2,\sigma=2.5$. We do not plot the purity and the number of clusters for Elastic Net as it produces a large number of clusters. The vertical bars at each point indicate the corresponding standard errors.
  • ...and 3 more figures

Theorems & Definitions (45)

  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Remark 1
  • Example 1
  • Definition 1: Approximation Error
  • Theorem 3
  • Remark 2
  • Remark 3
  • Remark 4
  • ...and 35 more