Table of Contents
Fetching ...

Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning

Sirui Chen, Yunzhe Qi, Mengting Ai, Yifan Sun, Ruizhong Qiu, Jiaru Zou, Jingrui He

TL;DR

This work tackles the high cost of gradient-based data selection for large-language-model fine-tuning by introducing IProX, a two-stage framework that constructs influence-preserving proxies directly from the target model. Stage 1 IPSVD compresses weights via an influence-aware low-rank decomposition that reweights components using second-moment statistics, while Stage 2 aligns the proxy’s gradients in the low-rank space and anchors outputs with a KL loss to preserve the target’s influence signals. Across multiple model families and tasks, IProX consistently outperforms off-the-shelf proxies and baselines, sometimes even surpassing the target model itself in data selection quality, while achieving substantial efficiency gains (e.g., over 50% reduction in computation for some settings). This yields a scalable, practical pathway to gradient-based data selection for large-scale LLM fine-tuning, enabling more efficient use of computational resources without sacrificing downstream performance.

Abstract

Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits a model's downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce Iprox, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model's influence. Experimental results across diverse LLM families and evaluation tasks show that Iprox consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with Iprox achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, Iprox achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that Iprox provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.

Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning

TL;DR

This work tackles the high cost of gradient-based data selection for large-language-model fine-tuning by introducing IProX, a two-stage framework that constructs influence-preserving proxies directly from the target model. Stage 1 IPSVD compresses weights via an influence-aware low-rank decomposition that reweights components using second-moment statistics, while Stage 2 aligns the proxy’s gradients in the low-rank space and anchors outputs with a KL loss to preserve the target’s influence signals. Across multiple model families and tasks, IProX consistently outperforms off-the-shelf proxies and baselines, sometimes even surpassing the target model itself in data selection quality, while achieving substantial efficiency gains (e.g., over 50% reduction in computation for some settings). This yields a scalable, practical pathway to gradient-based data selection for large-scale LLM fine-tuning, enabling more efficient use of computational resources without sacrificing downstream performance.

Abstract

Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits a model's downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce Iprox, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model's influence. Experimental results across diverse LLM families and evaluation tasks show that Iprox consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with Iprox achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, Iprox achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that Iprox provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.
Paper Structure (40 sections, 3 theorems, 47 equations, 4 figures, 15 tables)

This paper contains 40 sections, 3 theorems, 47 equations, 4 figures, 15 tables.

Key Result

Proposition 4.1

Consider a perturbation to layer $\ell$: $W_\ell \mapsto \widehat{W}_\ell = W_\ell + E_\ell$. Under assumptions of local smoothness, geometric coherence, and a bounded covariate shift condition between the distributions of $z$ and $z'$ (see Appendix app:proof_proposition_tracin_influence_bound for d

Figures (4)

  • Figure 1: For Qwen3-4B, a 1.5B IProX outperforms the Qwen3-1.7B off-the-shelf proxy, demonstrating that a smaller influence-pre-serving proxy can achieve better data selection performance.
  • Figure 2: Overview of IProX. In the first stage (left), IPSVD leverages hidden states and gradients to build second-moment matrices that reweight the model weights for proxy initialization. In the second stage (right), the proxy is further aligned with the target LLM through internal gradient alignment in the low-rank space and external logits anchoring for stability.
  • Figure 3: Loss and influence (TracIn) retention of SVD and our IPSVD under different compression sparsity.
  • Figure 4: TFLOPs breakdown on Llama3.2-3B across different sparsity levels.

Theorems & Definitions (5)

  • Proposition 4.1
  • Proposition D.1
  • proof
  • Proposition E.1
  • proof