Reframing Data Value for Large Language Models Through the Lens of Plausibility

Mohamad Rida Rammal; Ruida Zhou; Suhas Diggavi

Reframing Data Value for Large Language Models Through the Lens of Plausibility

Mohamad Rida Rammal, Ruida Zhou, Suhas Diggavi

TL;DR

This work proposes an alternative perspective on the data value problem for language models, centering around the plausibility of the data, and posit that data holds lesser value if it can be plausibly generated by the model itself.

Abstract

Data valuation seeks to answer the important question, "How much is this data worth?" Existing data valuation methods have largely focused on discriminative models, primarily examining data value through the lens of its utility in training. However, with the push for ever-larger language models, relying on valuation methods that require training becomes increasingly expensive and dependent on specific techniques. We propose an alternative perspective on the data value problem for language models, centering around the plausibility of the data. We posit that data holds lesser value if it can be plausibly generated by the model itself. Starting from some intuitive criteria that align with our notions of valuable data, we develop a novel value function that is computationally tractable and derived from first principles with provable properties. We conduct a theoretical analysis of our value function and evaluate it across multiple scenarios and datasets.

Reframing Data Value for Large Language Models Through the Lens of Plausibility

TL;DR

Abstract

Paper Structure (38 sections, 11 theorems, 38 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 38 sections, 11 theorems, 38 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Preliminaries
Setup
Properties of a Desirable Value Function
Overview of the Solution
Alternatives
Value function and its properties
The Components of the Value Function
Comparing Against the Uniform.
Independence Testing.
The UMI Value Function
Analysis
IID Case
Markov Case
Experiments
...and 23 more sections

Key Result

Theorem 3.1

Let $X=(X_1,X_2,\ldots,X_d)$ be a continuous $d$-dimensional random vector with a joint density function Let $F_i(\cdot|x_1,\ldots,x_{i-1})$ be the conditional cumulative distribution function corresponding to the conditional density function $f_i(\cdot|x_1,\ldots,x_{i-1})$. The random variables $Z_1, \ldots, Z_d$ given by Rosenblatt's transformation are independent and identically distributed f

Figures (3)

Figure 1: A discrete CDF and its interpolated counterpart.
Figure 2: Marginal cumulative distribution of data. For different data, we plot the obtained marginal cumulative distribution function alongside the cumulative distribution function of the standard uniform for (a) data generated by the model, (b) random tokens, (c) random characters, and (d-f) new unseen data.
Figure 3: UMI value function when the prompt is not given.

Theorems & Definitions (18)

Theorem 3.1: Rosenblatt's transformation rosenblatt1952remarks
Definition 3.1
Theorem 3.2
proof
Lemma 3.1
proof
Lemma 3.2
proof
Theorem 3.3
proof
...and 8 more

Reframing Data Value for Large Language Models Through the Lens of Plausibility

TL;DR

Abstract

Reframing Data Value for Large Language Models Through the Lens of Plausibility

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (18)