Table of Contents
Fetching ...

Sell Data to AI Algorithms Without Revealing It: Secure Data Valuation and Sharing via Homomorphic Encryption

Michael Yang, Ruijiang Gao, Zhiqiang, Zheng

TL;DR

The paper tackles Arrow’s Information Paradox in AI data markets by introducing the Trustworthy Influence Protocol (TIP), a privacy-preserving framework that enables buyers to quantify external data utility without decrypting the raw data. It combines CKKS-based homomorphic encryption with gradient-based influence functions, using low-rank gradient projections to scale to large models like LLMs. Empirical results across MNIST, BERT, and GPT-2 show encrypted valuations closely track plaintext utility with modest overhead, and healthcare/book-market simulations reveal high predictive fidelity and pronounced data-utility heterogeneity, supporting meritocratic pricing over flat-rate models. Collectively, TIP provides a scalable, cryptographically sound foundation for a private, performance-driven data economy in AI development and deployment.

Abstract

The rapid expansion of Artificial Intelligence is hindered by a fundamental friction in data markets: the value-privacy dilemma, where buyers cannot verify a dataset's utility without inspection, yet inspection may expose the data (Arrow's Information Paradox). We resolve this challenge by introducing the Trustworthy Influence Protocol (TIP), a privacy-preserving framework that enables prospective buyers to quantify the utility of external data without ever decrypting the raw assets. By integrating Homomorphic Encryption with gradient-based influence functions, our approach allows for the precise, blinded scoring of data points against a buyer's specific AI model. To ensure scalability for Large Language Models (LLMs), we employ low-rank gradient projections that reduce computational overhead while maintaining near-perfect fidelity to plaintext baselines, as demonstrated across BERT and GPT-2 architectures. Empirical simulations in healthcare and generative AI domains validate the framework's economic potential: we show that encrypted valuation signals achieve a high correlation with realized clinical utility and reveal a heavy-tailed distribution of data value in pre-training corpora where a minority of texts drive capability while the majority degrades it. These findings challenge prevailing flat-rate compensation models and offer a scalable technical foundation for a meritocratic, secure data economy.

Sell Data to AI Algorithms Without Revealing It: Secure Data Valuation and Sharing via Homomorphic Encryption

TL;DR

The paper tackles Arrow’s Information Paradox in AI data markets by introducing the Trustworthy Influence Protocol (TIP), a privacy-preserving framework that enables buyers to quantify external data utility without decrypting the raw data. It combines CKKS-based homomorphic encryption with gradient-based influence functions, using low-rank gradient projections to scale to large models like LLMs. Empirical results across MNIST, BERT, and GPT-2 show encrypted valuations closely track plaintext utility with modest overhead, and healthcare/book-market simulations reveal high predictive fidelity and pronounced data-utility heterogeneity, supporting meritocratic pricing over flat-rate models. Collectively, TIP provides a scalable, cryptographically sound foundation for a private, performance-driven data economy in AI development and deployment.

Abstract

The rapid expansion of Artificial Intelligence is hindered by a fundamental friction in data markets: the value-privacy dilemma, where buyers cannot verify a dataset's utility without inspection, yet inspection may expose the data (Arrow's Information Paradox). We resolve this challenge by introducing the Trustworthy Influence Protocol (TIP), a privacy-preserving framework that enables prospective buyers to quantify the utility of external data without ever decrypting the raw assets. By integrating Homomorphic Encryption with gradient-based influence functions, our approach allows for the precise, blinded scoring of data points against a buyer's specific AI model. To ensure scalability for Large Language Models (LLMs), we employ low-rank gradient projections that reduce computational overhead while maintaining near-perfect fidelity to plaintext baselines, as demonstrated across BERT and GPT-2 architectures. Empirical simulations in healthcare and generative AI domains validate the framework's economic potential: we show that encrypted valuation signals achieve a high correlation with realized clinical utility and reveal a heavy-tailed distribution of data value in pre-training corpora where a minority of texts drive capability while the majority degrades it. These findings challenge prevailing flat-rate compensation models and offer a scalable technical foundation for a meritocratic, secure data economy.

Paper Structure

This paper contains 34 sections, 2 theorems, 15 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $H_{\hat{\theta}} = \nabla^2 R_n(\hat{\theta})$ be the Hessian of the training loss at the optimum. The influence of an exogenous candidate point $z_s$ on the loss of a specific evaluation point $z_{\mathrm{eval}} \in D_{\mathrm{eval}}$ is given by:

Figures (2)

  • Figure 1: Proposed Secure Data Marketplace with Homomorphic Encryption.
  • Figure 2: Distribution of Secure Influence Scores in the Book Marketplace. The distinct skew reveals that a minority of texts drive positive outcomes, while the majority contribute negligible or negative utility, illustrating the inefficiency of uniform pricing.

Theorems & Definitions (2)

  • Lemma 3.1: Influence of a Candidate Point
  • Lemma 3.2: Approximate homomorphism of CKKS