Table of Contents
Fetching ...

ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation

Yanzhou Pan, Huawei Lin, Yide Ran, Jiamin Chen, Xiaodong Yu, Weijie Zhao, Denghui Zhang, Zhaozhuo Xu

TL;DR

The paper addresses scalable third-party data valuation for LLM pretraining by introducing LinFiK, a linearized, first-order measure of a training sample’s future influence, and ALinFiK, a distillation-based approach that rapidly approximates LinFiK using a small model. It proves LinFiK’s numerical stability and predictive power, and integrates ALinFiK into a scalable valuation system that facilitates high-value data selection and transparent data pricing. Extensive experiments on Howdy!Alpaca and WikiText demonstrate that ALinFiK outperforms baselines in efficiency and scalability, achieving strong data-selection performance with dramatically lower memory and compute costs, even on large LLMs. This work paves the way for more efficient, fair data markets by enabling early, data-driven valuation and compensation of data contributors. The practical impact includes faster training convergence, reduced resource usage, and a principled framework for pricing data contributions in large-scale language-model ecosystems.

Abstract

Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.

ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation

TL;DR

The paper addresses scalable third-party data valuation for LLM pretraining by introducing LinFiK, a linearized, first-order measure of a training sample’s future influence, and ALinFiK, a distillation-based approach that rapidly approximates LinFiK using a small model. It proves LinFiK’s numerical stability and predictive power, and integrates ALinFiK into a scalable valuation system that facilitates high-value data selection and transparent data pricing. Extensive experiments on Howdy!Alpaca and WikiText demonstrate that ALinFiK outperforms baselines in efficiency and scalability, achieving strong data-selection performance with dramatically lower memory and compute costs, even on large LLMs. This work paves the way for more efficient, fair data markets by enabling early, data-driven valuation and compensation of data contributors. The practical impact includes faster training convergence, reduced resource usage, and a principled framework for pricing data contributions in large-scale language-model ecosystems.

Abstract

Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.

Paper Structure

This paper contains 33 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Integration of ALinFiK into Scalable Third-Party Data Valuation System of LLMs. LLM takes the task-specific test data and the sampled training data to produce LinFiK scores based on Equation \ref{['eq:linfik']}. The ALinFiK algorithm is then adopted to approximate LinFiK. This system satisfies the requirements of both model owners and data providers. (a) For model owners: the ALinFiK scores enable the selection of high-value training data that aligns with model training objectives; (b) For data providers: the ALinFiK scores provide transparent, quantitative metrics for fair data compensation.
  • Figure 2: Visualization of Propositions. (a) The cosine similarity of a gradient vector for a given data converges to one rapidly, validating Proposition 1. (b) The relative ordering of gradient vector norms remains stable across training, validating Proposition 2.
  • Figure 3: Results for (a) Experiment \ref{['exp:predicting_future_data_influence']}: ALinFiK’s stability during training (solid line to the left side Y-axis, dashed line to the right-side log-scaled Y-axis), and (b) Experiment \ref{['exp:brittleness_test']}: the brittleness test of various baselines.