Table of Contents
Fetching ...

Matrix Product Sketching via Coordinated Sampling

Majid Daliri, Juliana Freire, Danrong Li, Christopher Musco

TL;DR

This work addresses the problem of approximating the matrix product $\boldsymbol{A}^T\boldsymbol{B}$ from small sketches that are computed independently with a shared random seed. It introduces coordinated sampling via priority sampling to estimate $\boldsymbol{A}^T\boldsymbol{B}$ from row-subsets of $\boldsymbol{A}$ and $\boldsymbol{B}$, providing an unbiased estimator with a Frobenius error bound that matches the best linear-sketch guarantees in the worst case but improves substantially for sparse matrices. The main theoretical result shows a fixed sketch size $k = \frac{2}{\delta\varepsilon^2} + 1$ yielding $\|\mathbf{W}-\boldsymbol{A}^T\boldsymbol{B}\|_F \leq \varepsilon \|oldsymbol{A}\|_F \|oldsymbol{B}\|_F$ with probability $1-\delta$, while being computable entirely independently by the two parties. The authors demonstrate practical benefits in distributed linear regression and attention-matrix approximation for transformers, achieving orders-of-magnitude improvements in space and communication compared with traditional Johnson–Lindenstrauss-style sketches, especially when the input matrices are sparse. Overall, the work offers a scalable, independent-sketch paradigm that preserves accuracy while enabling efficient cross-machine matrix products and related regression tasks.

Abstract

We revisit the well-studied problem of approximating a matrix product, $\mathbf{A}^T\mathbf{B}$, based on small space sketches $\mathcal{S}(\mathbf{A})$ and $\mathcal{S}(\mathbf{B})$ of $\mathbf{A} \in \R^{n \times d}$ and $\mathbf{B}\in \R^{n \times m}$. We are interested in the setting where the sketches must be computed independently of each other, except for the use of a shared random seed. We prove that, when $\mathbf{A}$ and $\mathbf{B}$ are sparse, methods based on \emph{coordinated random sampling} can outperform classical linear sketching approaches, like Johnson-Lindenstrauss Projection or CountSketch. For example, to obtain Frobenius norm error $ε\|\mathbf{A}\|_F\|\mathbf{B}\|_F$, coordinated sampling requires sketches of size $O(s/ε^2)$ when $\mathbf{A}$ and $\mathbf{B}$ have at most $s \leq d,m$ non-zeros per row. In contrast, linear sketching leads to sketches of size $O(d/ε^2)$ and $O(m/ε^2)$ for $\mathbf{A}$ and $\mathbf{B}$. We empirically evaluate our approach on two applications: 1) distributed linear regression in databases, a problem motivated by tasks like dataset discovery and augmentation, and 2) approximating attention matrices in transformer-based language models. In both cases, our sampling algorithms yield an order of magnitude improvement over linear sketching.

Matrix Product Sketching via Coordinated Sampling

TL;DR

This work addresses the problem of approximating the matrix product from small sketches that are computed independently with a shared random seed. It introduces coordinated sampling via priority sampling to estimate from row-subsets of and , providing an unbiased estimator with a Frobenius error bound that matches the best linear-sketch guarantees in the worst case but improves substantially for sparse matrices. The main theoretical result shows a fixed sketch size yielding with probability , while being computable entirely independently by the two parties. The authors demonstrate practical benefits in distributed linear regression and attention-matrix approximation for transformers, achieving orders-of-magnitude improvements in space and communication compared with traditional Johnson–Lindenstrauss-style sketches, especially when the input matrices are sparse. Overall, the work offers a scalable, independent-sketch paradigm that preserves accuracy while enabling efficient cross-machine matrix products and related regression tasks.

Abstract

We revisit the well-studied problem of approximating a matrix product, , based on small space sketches and of and . We are interested in the setting where the sketches must be computed independently of each other, except for the use of a shared random seed. We prove that, when and are sparse, methods based on \emph{coordinated random sampling} can outperform classical linear sketching approaches, like Johnson-Lindenstrauss Projection or CountSketch. For example, to obtain Frobenius norm error , coordinated sampling requires sketches of size when and have at most non-zeros per row. In contrast, linear sketching leads to sketches of size and for and . We empirically evaluate our approach on two applications: 1) distributed linear regression in databases, a problem motivated by tasks like dataset discovery and augmentation, and 2) approximating attention matrices in transformer-based language models. In both cases, our sampling algorithms yield an order of magnitude improvement over linear sketching.

Paper Structure

This paper contains 22 sections, 5 theorems, 29 equations, 9 figures, 4 algorithms.

Key Result

Theorem 2

Consider $\mathbf{A}\in \mathbb{R}^{n \times d}$, $\mathbf{B}\in \mathbb{R}^{n \times m}$, and any $\epsilon, \delta \in (0,1)$. There is a sketching procedure (alg:priority_sampling) that constructs sketches $\mathcal{S}(\mathbf{A})$ and $\mathcal{S}(\mathbf{B})$ consisting of at most $k = \frac{2/

Figures (9)

  • Figure 1: Performance of matrix product sketching over synthetic data with varying sparsity levels (10%, 40%, and 80%). Priority sampling and threshold sampling are depicted on top of each other and both methods outperform the JL sketch as the level of sparsity increases.
  • Figure 2: Comparison of Regression Sketching Methods on the IMDB Dataset: The plots illustrate the approximation error of different sketching methods across various sketch sizes. The matrix $\mathbf{A}$ is generated using TF-IDF on 10,000 random reviews, keeping the top 256, 512, and 1024 features. As the dimensionality increases, the matrices become more sparse. The matrix $\mathbf{b}$ represents the sentiment scores (positivity or negativity) of the reviews.
  • Figure 3: Sketched Regression Methods on the Android Review Dataset: The plots illustrate the approximation error of different sketching methods across various sketch sizes. The matrix $\mathbf{A}$ is generated using sparse transformer SPLADE formal2022distillation over 10,000 random reviews, retaining the top 128, 256, and 512 important features. The matrix $\mathbf{b}$ represents the review scores.
  • Figure 4: Comparison of KV Cache Sketching Methods on the LongBench for MultiFieldQA: The plots show the accuracy of different sketching methods approximating $\mathbf{Q} \mathbf{K}^T$ across various sketch sizes. The matrices $\mathbf{Q}$ and $\mathbf{K}$ are generated from prompt tokens, and the approximation errors are displayed.
  • Figure 5: Comparison of KV Cache Sketching Methods on the LongBench for MultiFieldQA: The plots illustrate the accuracy of various sketching methods in approximating $\mathbf{Q} \mathbf{K}^T$ across different sketch sizes. The Query matrix remains untouched, and only the Key matrices $\mathbf{K}$ are sketched using Priority Sampling and Threshold Sampling, whereas the JL sketch requires the projection of both matrices $\mathbf{Q}, \mathbf{K}$.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Theorem 2: Main Result
  • Theorem 3: Sketched Regression
  • Theorem 4
  • Lemma 5
  • proof : Proof of \ref{['thm:main_priority']}
  • proof : Proof of \ref{['thrm:main']}
  • proof : Proof of \ref{['thrm:regression']}
  • proof : Proof of \ref{['lemma:innerprod_columns']}
  • Theorem 6
  • proof : Proof of \ref{['thm:threshold']}