Fair Document Valuation in LLM Summaries via Shapley Values

Zikun Ye; Hema Yoganarasimhan

Fair Document Valuation in LLM Summaries via Shapley Values

Zikun Ye, Hema Yoganarasimhan

TL;DR

This work formalizes a Shapley-value framework to fairly credit individual source documents used in LLM-generated summaries, addressing attribution and revenue-sharing challenges. To scale to real platforms, it introduces Cluster Shapley, a structure-aware approximation that clusters semantically similar documents via embeddings and reasoned with a tunable diameter ε, with theoretical bounds showing error shrinking as ε → 0. Empirically, on Amazon product reviews, Cluster Shapley outperforms standard Shapley approximations and simple attribution rules, offering a favorable efficiency-accuracy trade-off and robust performance across LLMs and evaluation setups. The findings highlight the value of leveraging embedding-based structure in attribution and provide a scalable pathway for fair content monetization in AI-powered search and summarization systems.

Abstract

Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these systems enhance user experience through coherent summaries, they obscure the individual contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries by proposing a Shapley value-based framework for fair document valuation. Although theoretically appealing, exact Shapley value computation is prohibitively expensive at scale. To improve efficiency, we develop Cluster Shapley, a simple approximation algorithm that leverages semantic similarity among documents to reduce computation while maintaining attribution accuracy. Using Amazon product review data, we empirically show that off-the-shelf Shapley approximations, such as Monte Carlo sampling and Kernel SHAP, perform suboptimally in LLM settings, whereas Cluster Shapley substantially improves the efficiency-accuracy frontier. Moreover, simple attribution rules (e.g., equal or relevance-based allocation), though computationally cheap, lead to highly unfair outcomes. Together, our findings highlight the potential of structure-aware Shapley approximations tailored to LLM summarization and offer guidance for platforms seeking scalable and fair content attribution mechanisms.

Fair Document Valuation in LLM Summaries via Shapley Values

TL;DR

Abstract

Fair Document Valuation in LLM Summaries via Shapley Values

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (8)