Table of Contents
Fetching ...

Solving Sparse \& High-Dimensional-Output Regression via Compression

Renyuan Li, Zhehui Chen, Guanyi Wang

TL;DR

A Sparse \&High-dimensional-Output REgression (SHORE) model is proposed by incorporating additional sparsity requirements to resolve the output interpretability, and then a computationally efficient two-stage optimization framework capable of solving SHORE with provable accuracy via compression on outputs is designed.

Abstract

Multi-Output Regression (MOR) has been widely used in scientific data analysis for decision-making. Unlike traditional regression models, MOR aims to simultaneously predict multiple real-valued outputs given an input. However, the increasing dimensionality of the outputs poses significant challenges regarding interpretability and computational scalability for modern MOR applications. As a first step to address these challenges, this paper proposes a Sparse \& High-dimensional-Output REgression (SHORE) model by incorporating additional sparsity requirements to resolve the output interpretability, and then designs a computationally efficient two-stage optimization framework capable of solving SHORE with provable accuracy via compression on outputs. Theoretically, we show that the proposed framework is computationally scalable while maintaining the same order of training loss and prediction loss before-and-after compression under arbitrary or relatively weak sample set conditions. Empirically, numerical results further validate the theoretical findings, showcasing the efficiency and accuracy of the proposed framework.

Solving Sparse \& High-Dimensional-Output Regression via Compression

TL;DR

A Sparse \&High-dimensional-Output REgression (SHORE) model is proposed by incorporating additional sparsity requirements to resolve the output interpretability, and then a computationally efficient two-stage optimization framework capable of solving SHORE with provable accuracy via compression on outputs is designed.

Abstract

Multi-Output Regression (MOR) has been widely used in scientific data analysis for decision-making. Unlike traditional regression models, MOR aims to simultaneously predict multiple real-valued outputs given an input. However, the increasing dimensionality of the outputs poses significant challenges regarding interpretability and computational scalability for modern MOR applications. As a first step to address these challenges, this paper proposes a Sparse \& High-dimensional-Output REgression (SHORE) model by incorporating additional sparsity requirements to resolve the output interpretability, and then designs a computationally efficient two-stage optimization framework capable of solving SHORE with provable accuracy via compression on outputs. Theoretically, we show that the proposed framework is computationally scalable while maintaining the same order of training loss and prediction loss before-and-after compression under arbitrary or relatively weak sample set conditions. Empirically, numerical results further validate the theoretical findings, showcasing the efficiency and accuracy of the proposed framework.

Paper Structure

This paper contains 36 sections, 9 theorems, 103 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

For any $\delta \in (0, 1)$ and $\tau \in (0,1)$, suppose compressed matrix $\bm{\Phi}$ follows Assumption assump:RIP-Phi with $m \geq O(\frac{1}{\delta^{2}}\cdot \log(\frac{K}{\tau}))$. We have the following inequality for training loss holds with probability at least $1 - \tau$, where $\widehat{\bm{Z}}, \widehat{\bm{W}}$ are optimal solutions for the uncompressed eq:MLC and compressed SHORE eq:

Figures (4)

  • Figure 1: Numerical results on synthetic data. In short, each dot in the figure represents the average value of 10 independent trials (i.e., experiments) of compressed matrices $\bm{\Phi}^{(1)}, \ldots, \bm{\Phi}^{(10)}$ on a given tuple of parameters $(K,d,n,\text{SNR},m)$. The shaded parts represent the empirical standard deviations over 10 trials. In the first row, we plot the ratio of training loss after and before compression, i.e., $\|\bm{\Phi Y} - \widehat{\bm{W}}\bm{X}\|_F^2/\|\bm{Y} - \widehat{\bm{Z}}\bm{X}\|_F^2$ versus the number of rows $m$. It is obvious that the ratio converges to one as $m$ increases, which validates the result presented in Theorem \ref{['thm:training-loss-bound']}. In the second row, we plot percision@3 versus the number of rows. As we can observe, the proposed algorithm outperforms CD and FISTA.
  • Figure 2: The figure reports the prediction running time (measured in seconds) on synthetic data with early stopping by the proposed algorithm under different compressed output dimensions. As we can observe, the running time first decreases dramatically, then increases almost linearly with respect to $m$ . Such a phenomenon has occurred since the max number of iterations is 60 in the implemented prediction method with early stopping, which is relatively large; As $m$ increases but is still less than 500, the actual number of iterations drops dramatically due to early stopping criteria; After passes 500, the actual number of iterations stays around 10, and then the running time grows linearly as dimension increases.
  • Figure 3: This figure reports the numerical results on real data -- EURLex-4K. Each dot in the figure represents 10 independent trials (i.e., experiments) of compressed matrices $\bm{\Phi}^{(1)}, \ldots, \bm{\Phi}^{(10)}$ on a given tuple of parameters $(s,m)$. The curves in each panel correspond to the averaged values for the proposed Algorithm and baselines over 10 trials; the shaded parts represent the empirical standard deviations over 10 trials. In the first row, we plot the output distance versus the number of rows. In the second row, we plot the precision versus the number of rows, and we cannot observe significant differences between these prediction methods.
  • Figure 4: This figure reports the numerical results on real data -- Wiki10-31K. Similar to the plot reporting on EURLex-4K above, each dot in the figure represents 10 independent trials (i.e., experiments) of compressed matrices $\bm{\Phi}^{(1)}, \ldots, \bm{\Phi}^{(10)}$ on a given tuple of parameters $(s,m)$. The curves in each panel correspond to the averaged values for the proposed algorithm and baselines over 10 trials; the shaded parts represent the empirical standard deviations over 10 trials. Similarly, in the first row, the precision of the proposed algorithm outperforms the FISTA especially when $s$ is small. In the second & third rows for output difference and prediction loss, there are only slight improvement on the proposed algorithm than CD of output difference.

Theorems & Definitions (26)

  • Remark 1
  • Remark 2
  • Definition 1
  • Remark 3
  • Theorem 1
  • Theorem 2
  • Remark 4
  • Theorem 3
  • Remark 5
  • Theorem 4
  • ...and 16 more