Table of Contents
Fetching ...

SALSA: Sequential Approximate Leverage-Score Algorithm with Application in Analyzing Big Time Series Data

Ali Eshragh, Luke Yerbury, Asef Nazari, Fred Roosta, Michael W. Mahoney

TL;DR

This work introduces SALSA, a sequential approximate leverage-score algorithm grounded in RandNLA that delivers $1 + O(\varepsilon)$-relative accuracy with high probability for row leverage-scores, enabling scalable processing of massive matrices. The authors then harness SALSA to create LSARMA, a method that applies leverage-score sketching to large-scale time-series data to obtain maximum-likelihood ARMA parameter estimates with significantly improved worst-case running times. Theoretical results provide recursive exact and approximate leverage-score computations with rigorous relative-error bounds, while extensive synthetic and real-data experiments demonstrate SALSA’s practicality and substantial speedups over exact computations. Collectively, SALSA and LSARMA offer scalable, theoretically grounded tools for big-data linear algebra and time-series analysis, with potential impact on applications requiring fast, reliable model fitting on massive datasets.

Abstract

We develop a new efficient sequential approximate leverage score algorithm, SALSA, using methods from randomized numerical linear algebra (RandNLA) for large matrices. We demonstrate that, with high probability, the accuracy of SALSA's approximations is within $(1 + O({\varepsilon}))$ of the true leverage scores. In addition, we show that the theoretical computational complexity and numerical accuracy of SALSA surpass existing approximations. These theoretical results are subsequently utilized to develop an efficient algorithm, named LSARMA, for fitting an appropriate ARMA model to large-scale time series data. Our proposed algorithm is, with high probability, guaranteed to find the maximum likelihood estimates of the parameters for the true underlying ARMA model. Furthermore, it has a worst-case running time that significantly improves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale data strongly support these theoretical results and underscore the efficacy of our new approach.

SALSA: Sequential Approximate Leverage-Score Algorithm with Application in Analyzing Big Time Series Data

TL;DR

This work introduces SALSA, a sequential approximate leverage-score algorithm grounded in RandNLA that delivers -relative accuracy with high probability for row leverage-scores, enabling scalable processing of massive matrices. The authors then harness SALSA to create LSARMA, a method that applies leverage-score sketching to large-scale time-series data to obtain maximum-likelihood ARMA parameter estimates with significantly improved worst-case running times. Theoretical results provide recursive exact and approximate leverage-score computations with rigorous relative-error bounds, while extensive synthetic and real-data experiments demonstrate SALSA’s practicality and substantial speedups over exact computations. Collectively, SALSA and LSARMA offer scalable, theoretically grounded tools for big-data linear algebra and time-series analysis, with potential impact on applications requiring fast, reliable model fitting on massive datasets.

Abstract

We develop a new efficient sequential approximate leverage score algorithm, SALSA, using methods from randomized numerical linear algebra (RandNLA) for large matrices. We demonstrate that, with high probability, the accuracy of SALSA's approximations is within of the true leverage scores. In addition, we show that the theoretical computational complexity and numerical accuracy of SALSA surpass existing approximations. These theoretical results are subsequently utilized to develop an efficient algorithm, named LSARMA, for fitting an appropriate ARMA model to large-scale time series data. Our proposed algorithm is, with high probability, guaranteed to find the maximum likelihood estimates of the parameters for the true underlying ARMA model. Furthermore, it has a worst-case running time that significantly improves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale data strongly support these theoretical results and underscore the efficacy of our new approach.
Paper Structure (22 sections, 12 theorems, 91 equations, 6 figures, 2 algorithms)

This paper contains 22 sections, 12 theorems, 91 equations, 6 figures, 2 algorithms.

Key Result

theorem 1

Consider the matrix ${\bm{A}} _{m,d-1}$ as in eqn:A_md_in_col_form and eqn:define_A_recursively$\left( m>d \right)$ and its corresponding leverage-scores $\ell_{m,d-1} (i)$ for $i=1,...,m$. Then the leverage-scores of the augmented matrix ${\bm{A}} _{m,d}$ can be computed as $\text{for } d \geq 1, \text{ and } i=1,\ldots,m,$ where ${\bm r} _{m,d}$ is defined in eqn:exact_residual_defn and $\ell_{

Figures (6)

  • Figure 1: Non-uniform distribution of leverage-scores of the synthetic matrix with $m=20,000,000$ rows and $n=300$ columns, deliberately altered to include $2,000$ outliers.
  • Figure 2: (\ref{['fig:SALSA_time_error_a']}) shows the run times as a function of $s_1$ for different values of $s_2$ for computing both exact and estimated leverage-scores, while (\ref{['fig:SALSA_time_error_b']}) displays the run times as a function of $s_2$ for different fixed values of $s_1$. These results are obtained for a synthetic matrix with $m=20,000,000$ rows and $n=300$ columns, deliberately altered to include $2,000$ outliers. Furthermore, (\ref{['fig:SALSA_time_error_c']}) presents the $\mathtt{MAPE}$ against $s_1$, and (\ref{['fig:SALSA_time_error_d']}) depicts the same error metric against $s_2$ between the exact and estimated leverage-scores.
  • Figure 3:
  • Figure 5: (\ref{['fig:Real_s1_a']}) shows the run times of approximating leverage-scores of the real-data matrix as a function of $s_2$ for different values of $s_1$, while (\ref{['fig:Real_s2_b']}) displays the run times as a function of $s_2$ for different fixed values of $s_1$. Run time for using the exact method in computing the leverage-scores is depicted in blue. The real-data matrix contains $m=20,000,000$ rows and $n=160$ columns, extracted from the edge-weight data matrix in sybrandt2017moliere. Furthermore, (\ref{['fig:Real_s1_c']}) presents the $\mathtt{MAPE}$ against $s_1$, and (\ref{['fig:Real_s2_d']}) depicts the same error metric against $s_2$ between the exact and estimated leverage-scores.
  • Figure 6: (\ref{['fig:SALSA_MA2_a']}) Run times in computing exact and estimated parameters and (\ref{['fig:SALSA_MA2_b']}) percentage error in the difference between exact and estimated parameters for synthetic $\mathtt{MA}(q)$ data with $n=5,000,000$ entries.
  • ...and 1 more figures

Theorems & Definitions (23)

  • theorem 1: Recursive Scheme for Exact Leverage-Scores
  • theorem 2: drineasFasterLeastSquares2011
  • theorem 3: drineasFastMonteCarlo2006
  • definition 1: Sequential Approximate Leverage-Scores
  • theorem 4: Relative Errors for Sequential Approximate Leverage-Scores
  • remark 1: $\mathtt{LSARMA}$ Computational Complexity, eshraghLSAREfficientLeverage2019, Theorem 6
  • lemma 1: Block Matrix Inversion Lemma golubMatrixComputations2013
  • proof
  • lemma 2
  • proof
  • ...and 13 more