ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation
Yanzhou Pan, Huawei Lin, Yide Ran, Jiamin Chen, Xiaodong Yu, Weijie Zhao, Denghui Zhang, Zhaozhuo Xu
TL;DR
The paper addresses scalable third-party data valuation for LLM pretraining by introducing LinFiK, a linearized, first-order measure of a training sample’s future influence, and ALinFiK, a distillation-based approach that rapidly approximates LinFiK using a small model. It proves LinFiK’s numerical stability and predictive power, and integrates ALinFiK into a scalable valuation system that facilitates high-value data selection and transparent data pricing. Extensive experiments on Howdy!Alpaca and WikiText demonstrate that ALinFiK outperforms baselines in efficiency and scalability, achieving strong data-selection performance with dramatically lower memory and compute costs, even on large LLMs. This work paves the way for more efficient, fair data markets by enabling early, data-driven valuation and compensation of data contributors. The practical impact includes faster training convergence, reduced resource usage, and a principled framework for pricing data contributions in large-scale language-model ecosystems.
Abstract
Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.
