Sketchy Moment Matching: Toward Fast and Provable Data Selection for Finetuning

Yijun Dong; Hoang Phan; Xiang Pan; Qi Lei

Sketchy Moment Matching: Toward Fast and Provable Data Selection for Finetuning

Yijun Dong, Hoang Phan, Xiang Pan, Qi Lei

TL;DR

Sketchy Moment Matching (SkMM) addresses data selection for finetuning in high-dimensional models by harnessing a variance-bias tradeoff driven by a low intrinsic dimension. The method first uses gradient sketching to identify a compact subspace ${\mathcal{S}}$ that captures the essential finetuning directions, then performs moment matching within this subspace to control variance, achieving a fast-rate generalization ${O(\dim({\mathcal{S}})/n)}$. Theoretical results show gradient sketching provably yields a low-bias subspace and preserves fast-rate learning, while a practical quadratic-programming relaxation enables scalable moment matching in the reduced space. Empirical results on synthetic data, regression, and image-classification tasks demonstrate SkMM’s advantage in low-data regimes, with robustness to data imbalances and strong performance relative to standard baselines.

Abstract

We revisit data selection in a modern context of finetuning from a fundamental perspective. Extending the classical wisdom of variance minimization in low dimensions to high-dimensional finetuning, our generalization analysis unveils the importance of additionally reducing bias induced by low-rank approximation. Inspired by the variance-bias tradeoff in high dimensions from the theory, we introduce Sketchy Moment Matching (SkMM), a scalable data selection scheme with two stages. (i) First, the bias is controlled using gradient sketching that explores the finetuning parameter space for an informative low-dimensional subspace $\mathcal{S}$; (ii) then the variance is reduced over $\mathcal{S}$ via moment matching between the original and selected datasets. Theoretically, we show that gradient sketching is fast and provably accurate: selecting $n$ samples by reducing variance over $\mathcal{S}$ preserves the fast-rate generalization $O(\dim(\mathcal{S})/n)$, independent of the parameter dimension. Empirically, we concretize the variance-bias balance via synthetic experiments and demonstrate the effectiveness of SkMM for finetuning in real vision tasks.

Sketchy Moment Matching: Toward Fast and Provable Data Selection for Finetuning

TL;DR

that captures the essential finetuning directions, then performs moment matching within this subspace to control variance, achieving a fast-rate generalization

. Theoretical results show gradient sketching provably yields a low-bias subspace and preserves fast-rate learning, while a practical quadratic-programming relaxation enables scalable moment matching in the reduced space. Empirical results on synthetic data, regression, and image-classification tasks demonstrate SkMM’s advantage in low-data regimes, with robustness to data imbalances and strong performance relative to standard baselines.

Abstract

; (ii) then the variance is reduced over

via moment matching between the original and selected datasets. Theoretically, we show that gradient sketching is fast and provably accurate: selecting

samples by reducing variance over

preserves the fast-rate generalization

, independent of the parameter dimension. Empirically, we concretize the variance-bias balance via synthetic experiments and demonstrate the effectiveness of SkMM for finetuning in real vision tasks.

Paper Structure (50 sections, 12 theorems, 79 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 50 sections, 12 theorems, 79 equations, 2 figures, 6 tables, 1 algorithm.

Introduction
Low intrinsic dimension leads to variance-bias tradeoff in data selection.
Gradient sketching finds a good low-dimensional subspace fast and provably.
Moment matching in low dimension selects data that control the variance.
Related Works
Coreset selection and low-rank approximations.
Gradient sketching.
Moment matching and optimal experimental design.
(Unsupervised) data selection.
Notations
Data Selection for Finetuning
Finetuning.
Data selection.
Low-dimensional Linear Probing: Variance Minimization
High-dimension Finetuning with Low Intrinsic Dimension: Variance-Bias Tradeoff
...and 35 more sections

Key Result

Proposition 2.1

Assume there exists For $S$ sampled uniformly (with replacement) over $\mathcal{D}$, with probability at least $1-\delta$ over $S$, $\boldsymbol{\Sigma}^{\phi}_{} \preccurlyeq c_S \boldsymbol{\Sigma}^{\phi}_{S}$ for any $c_S > 1$ if $n \gtrsim \frac{B_\phi^4}{\gamma^2} \cdot \frac{r + \log\left(1/\delta\right)}{\left(1

Figures (2)

Figure 1: Controlling variance-bias tradeoff in data selection for high-dimensional finetuning via gradient sketching $+$ moment matching (SkMM). Consider a toy dataset with $N$ samples (in blue) whose finetuning gradients lie in a high-dimensional parameter space $\mathbb{R}^r$ (visualized in 3D) with a low intrinsic dimension (e.g., three clusters). The goal is to select $n = n_1 + n_2 + n_3 < r$ samples for finetuning. (a) Bias reduction focuses on minimizing the low-rank approximation error, resulting in uniform selection across clusters regardless of their variance. (b) Variance reductionplaces more emphasis on high-variance clusters and could lead to large bias by missing low-variance ones. (c) Gradient sketching efficiently finds a low-dimensional subspace ${\mathcal{S}}$ (where $\dim({\mathcal{S}}) < n$) with small bias. (d) Moment matching in ${\mathcal{S}}$ controls the variance within the low-bias subspace, leading to a variance-bias balance with fast-rate generalization $O(\dim({\mathcal{S}})/n)$.
Figure 2: Selecting $n=80$ data (colored in red) from the GMM dataset. Intuitively, a coreset $\mathcal{D}_S$ with low bias contains at least one sample per cluster; whereas a low-variance $\mathcal{D}_S$ selects more data from clusters with larger variance. We recall from \ref{['thm:linear_probing_high_dim_ridge']} that the variance-bias balance is essential for good generalization.

Theorems & Definitions (31)

Proposition 2.1: Uniform sampling for low-dimensional linear probing (\ref{['apx:pf_linear_probing_sampled']})
Theorem 2.2: Main result I: variance-bias tradeoff (\ref{['apx:pf_linear_probing_high_dim']})
Corollary 2.3: Exploitation + exploration (\ref{['apx:pf_linear_probing_high_dim']})
Remark 3.1: Gradient sketching
Theorem 3.1: Main result II: gradient sketching (formally in \ref{['thm:sketchy_rmm']})
Remark 3.2: Relaxing $\widetilde{\boldsymbol{\Sigma}}^{\phi}_{} \preccurlyeq c_S \widetilde{\boldsymbol{\Sigma}}^{\phi}_{S}$ to \ref{['eq:rmm_high_dim']}
Remark 3.3: $c_S$ controls strength of moment matching
Remark 3.4: Computational efficiency of SkMM
Remark A.1: Leverage score sampling
Remark A.2: V-optimal experimental design
...and 21 more

Sketchy Moment Matching: Toward Fast and Provable Data Selection for Finetuning

TL;DR

Abstract

Sketchy Moment Matching: Toward Fast and Provable Data Selection for Finetuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (31)