Adaptive Matrix Sparsification and Applications to Empirical Risk Minimization
Yang P. Liu, Richard Peng, Colin Tang, Albert Weng, Junzhao Yang
TL;DR
<3-5 sentence high-level summary> This work develops a nearly-linear time algorithm for empirical risk minimization (ERM) with tall, dense constraint matrices by combining a robust interior point method (IPM) with an adaptive, dynamic spectral sparsifier. The key innovation is a data-structure for maintaining leverage-score overestimates under adaptive row updates, enabling efficient sampling and construction of a spectral sparsifier that approximates the Hessian along the central path. The main result shows that, under constant-sized blocks K_i with self-concordant barriers and a well-conditioned A, ERM can be solved to high accuracy in Õ(nd + d^6√n) time (and thus nearly linear in input size when n is large). The approach blends decremental sparsification, heavy-hitter trackers, and stability analyses of the central path to achieve provable efficiency gains for ERMs in tall-dense regimes.</p>
Abstract
Consider the empirical risk minimization (ERM) problem, which is stated as follows. Let $K_1, \dots, K_m$ be compact convex sets with $K_i \subseteq \mathbb{R}^{n_i}$ for $i \in [m]$, $n = \sum_{i=1}^m n_i$, and $n_i\le C_K$ for some absolute constant $C_K$. Also, consider a matrix $A \in \mathbb{R}^{n \times d}$ and vectors $b \in \mathbb{R}^d$ and $c \in \mathbb{R}^n$. Then the ERM problem asks to find \[ \min_{\substack{x \in K_1 \times \dots \times K_m\\ A^\top x = b}} c^\top x. \] We give an algorithm to solve this to high accuracy in time $\widetilde{O}(nd + d^6\sqrt{n}) \le \widetilde{O} (nd + d^{11})$, which is nearly-linear time in the input size when $A$ is dense and $n \ge d^{10}$. Our result is achieved by implementing an $\widetilde{O}(\sqrt{n})$-iteration interior point method (IPM) efficiently using dynamic data structures. In this direction, our key technical advance is a new algorithm for maintaining leverage score overestimates of matrices undergoing row updates. Formally, given a matrix $A \in \mathbb{R}^{n \times d}$ undergoing $T$ batches of row updates of total size $n$ we give an algorithm which can maintain leverage score overestimates of the rows of $A$ summing to $\widetilde{O}(d)$ in total time $\widetilde{O}(nd + Td^6)$. This data structure is used to sample a spectral sparsifier within a robust IPM framework to establish the main result.
