Table of Contents
Fetching ...

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

Guanghui Min, Tianhao Huang, Ke Wan, Chen Chen

TL;DR

This work proposes GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment and matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

Abstract

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via spectral filtering (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions.Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

TL;DR

This work proposes GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment and matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

Abstract

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via spectral filtering (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions.Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.
Paper Structure (46 sections, 5 theorems, 56 equations, 8 figures, 9 tables)

This paper contains 46 sections, 5 theorems, 56 equations, 8 figures, 9 tables.

Key Result

Theorem 3.1

Fix the current parameters $\boldsymbol{\theta}_t$. Under the first-order approximation, Problem def:problem is approximately reduced to maximizing the predicted reduction in validation loss: up to constants independent of $S$ and higher-order terms.

Figures (8)

  • Figure 1: Spectral analysis of the MMLU validation gradient $\mathbf{G}_{\text{val}}$ on Llama2-7B. We decompose the gradient matrix via SVD to obtain singular values $\sigma_i$. (a) Cumulative explained variance. A steeper curve indicates that a smaller principal subspace dimension is sufficient to capture the majority of the variance (e.g., Rank $150$ captures $95\%$), confirming high directional information density. (b) The singular values ($\sigma_k$) exhibit precipitous decay, further verifying the intrinsic low-rank structure.
  • Figure 2: Overview of GIST.Step 1: Lightweight warmup performs a short LoRA warmup on a sampled subset and computes validation gradients. Step 2: Spectral filtering applies an SVD on the validation gradient matrix to construct a low-rank target subspace (Target projector). Step 3: Geometric scoring projects candidate gradients onto the target subspace and selects Top-$k$ samples.
  • Figure 3: Impact of Checkpoint Selection. (a) Using single-epoch gradients shows a clear performance drop in later epochs. (b) Aggregating multiple checkpoints (weighted) does not outperform the early-stop strategy, confirming that early gradients contain the essential task optimization directions.
  • Figure 4: Accuracy as a function of the projection rank used by GIST, compared with LESS at the same selection budget.
  • Figure 5: Toy 2D optimization dynamics with the same initialization $\boldsymbol{\theta}_0=(-2.5,0)$. Newton (full-matrix) follows the direct descent direction, while Adam (diagonal) cannot express the rotation induced by coupling, leading to a "zig-zag" trajectory on the coupled landscape.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Theorem 3.1: Single-level Approximate Optimization
  • Theorem 3.2: LoRA induces off-diagonal curvature
  • Theorem 3.3: Eigenspace stability of the proxy
  • proof
  • proof
  • Lemma 5.2: Gauss--Newton/Fisher decomposition for NLL papyan2020traces
  • Lemma 5.3: Davis--Kahan Theorem davis1970rotation
  • proof
  • Remark 5.4: Theoretical Necessity of Warmup