Table of Contents
Fetching ...

Sparse Max-Affine Regression

Haitham Kanj, Seonho Kim, Kiryung Lee

Abstract

This paper presents Sparse Gradient Descent as a solution for variable selection in convex piecewise linear regression, where the model is given as the maximum of $k$-affine functions $ x \mapsto \max_{j \in [k]} \langle a_j^\star, x \rangle + b_j^\star$ for $j = 1,\dots,k$. Here, $\{ a_j^\star\}_{j=1}^k$ and $\{b_j^\star\}_{j=1}^k$ denote the ground-truth weight vectors and intercepts. A non-asymptotic local convergence analysis is provided for Sp-GD under sub-Gaussian noise when the covariate distribution satisfies the sub-Gaussianity and anti-concentration properties. When the model order and parameters are fixed, Sp-GD provides an $ε$-accurate estimate given $\mathcal{O}(\max(ε^{-2}σ_z^2,1)s\log(d/s))$ observations where $σ_z^2$ denotes the noise variance. This also implies the exact parameter recovery by Sp-GD from $\mathcal{O}(s\log(d/s))$ noise-free observations. The proposed initialization scheme uses sparse principal component analysis to estimate the subspace spanned by $\{ a_j^\star\}_{j=1}^k$, then applies an $r$-covering search to estimate the model parameters. A non-asymptotic analysis is presented for this initialization scheme when the covariates and noise samples follow Gaussian distributions. When the model order and parameters are fixed, this initialization scheme provides an $ε$-accurate estimate given $\mathcal{O}(ε^{-2}\max(σ_z^4,σ_z^2,1)s^2\log^4(d))$ observations. A new transformation named Real Maslov Dequantization (RMD) is proposed to transform sparse generalized polynomials into sparse max-affine models. The error decay rate of RMD is shown to be exponentially small in its temperature parameter. Furthermore, theoretical guarantees for Sp-GD are extended to the bounded noise model induced by RMD. Numerical Monte Carlo results corroborate theoretical findings for Sp-GD and the initialization scheme.

Sparse Max-Affine Regression

Abstract

This paper presents Sparse Gradient Descent as a solution for variable selection in convex piecewise linear regression, where the model is given as the maximum of -affine functions for . Here, and denote the ground-truth weight vectors and intercepts. A non-asymptotic local convergence analysis is provided for Sp-GD under sub-Gaussian noise when the covariate distribution satisfies the sub-Gaussianity and anti-concentration properties. When the model order and parameters are fixed, Sp-GD provides an -accurate estimate given observations where denotes the noise variance. This also implies the exact parameter recovery by Sp-GD from noise-free observations. The proposed initialization scheme uses sparse principal component analysis to estimate the subspace spanned by , then applies an -covering search to estimate the model parameters. A non-asymptotic analysis is presented for this initialization scheme when the covariates and noise samples follow Gaussian distributions. When the model order and parameters are fixed, this initialization scheme provides an -accurate estimate given observations. A new transformation named Real Maslov Dequantization (RMD) is proposed to transform sparse generalized polynomials into sparse max-affine models. The error decay rate of RMD is shown to be exponentially small in its temperature parameter. Furthermore, theoretical guarantees for Sp-GD are extended to the bounded noise model induced by RMD. Numerical Monte Carlo results corroborate theoretical findings for Sp-GD and the initialization scheme.

Paper Structure

This paper contains 32 sections, 25 theorems, 192 equations, 6 figures, 2 tables, 3 algorithms.

Key Result

Theorem 1.1

Let the covariates and noise be sampled independently from Sub-Gaussian distributions. For fixed $k$ and ground-truth $\hm{\theta}^\star$ satisfying eq:ind_sparsity, with high proability, a suitably initialized Sp-GD converges linearly to an $\epsilon$-accurate estimate of $\hm{\theta}^\star$ given

Figures (6)

  • Figure 1: Median of $\mathtt{err}(\widehat{\hm{\theta}})$ for different ($n$,$d$) pairs using 50 Monte Carlo iterations for $k =3$ and $s=25$ with Gaussian (top) and Uniform (bottom) covariate distributions in the noiseless case.
  • Figure 2: Median of $\mathtt{err}(\widehat{\hm{\theta}})$ for different ($n$,$s$) pairs using 50 Monte Carlo iterations for $k=3$ and $d=200$ with Gaussian (left) and Uniform (right) covariate distributions. The red curves are fitted with respect to $s\log (d/s)$ at the phase transition boundary for both figures.
  • Figure 3: Median of $\mathtt{err}(\widehat{\hm{\theta}})$ for different ($n$,$\sigma_z^2$) pairs using 50 Monte Carlo iterations for $s=50$, $d=200$ and $k=3$ with Gaussian covariates and local initial estimate.
  • Figure 4: Projection error difference using PCA (red), and Algorithm \ref{['algo:sPCA']} (blue), and dashed guidelines showing $1/\sqrt{n}$ decay (black) and $1/n$ decay (green) with $s=20$, $d=200$, $k=3$, $\sigma_z=0.1$ and 50 Monte Carlo iterations.
  • Figure 5: Parameter estimation error using PCA (red) and Algorithm \ref{['algo:sPCA']} (blue) when followed by $M$ random samples with $s=20$, $d=200$, $k=3$, $\sigma_z=0.1$ averaged over 50 Monte Carlo iterations.
  • ...and 1 more figures

Theorems & Definitions (39)

  • Theorem 1.1: Informal
  • Theorem 1.2: Informal
  • Theorem 1.3: Informal
  • Theorem 2.3
  • Remark 2.4
  • Theorem 3.1
  • proof : Proof of Theorem \ref{['Theo:Projection']}
  • Lemma 3.2
  • Theorem 3.3: A paraphrase of ghosh2021max
  • Remark 3.4
  • ...and 29 more