Table of Contents
Fetching ...

Model-diff: A Tool for Comparative Study of Language Models in the Input Space

Weitang Liu, Yuelei Li, Ying Wai Li, Zihan Wang, Jingbo Shang

TL;DR

The paper tackles the problem of comparing two large language models beyond fixed benchmark datasets by analyzing their predictions over a broad, human-understandable input space defined via low negative log-likelihood (NLL) inputs. It introduces Model-diff, a sampling-based framework that constructs and compares the output-distribution of prediction differences $\big(\rho_{A\rightarrow B}(\mathcal{D}), \rho_{B\rightarrow A}(\mathcal{D})\big)$ with $\mathcal{D} = \text{NLL}_A - \text{NLL}_B$, employing two-stage sampling and normalization over the input-overlap $|\mathbb{X}_{A\cap B}|$. The approach is validated on toy data (where enumeration is feasible) and real-world models (GPT2, Llama families), and is shown to reveal when disagreements concentrate at large $|\mathcal{D}|$ and how input types drive those disagreements. Practical applications demonstrated include model plagiarism detection and assessing which model better aligns with human annotations, offering a principled, scalable way to audit and compare language models without biased input design.

Abstract

Comparing two (large) language models (LMs) side-by-side and pinpointing their prediction similarities and differences on the same set of inputs are crucial in many real-world scenarios, e.g., one can test if a licensed model was potentially plagiarized by another. Traditional analysis compares the LMs' outputs on some benchmark datasets, which only cover a limited number of inputs of designed perspectives for the intended applications. The benchmark datasets cannot prepare data to cover the test cases from unforeseen perspectives which can help us understand differences between models unbiasedly. In this paper, we propose a new model comparative analysis setting that considers a large input space where brute-force enumeration would be infeasible. The input space can be simply defined as all token sequences that a LM would produce low perplexity on -- we follow this definition in the paper as it would produce the most human-understandable inputs. We propose a novel framework \our that uses text generation by sampling and deweights the histogram of sampling statistics to estimate prediction differences between two LMs in this input space efficiently and unbiasedly. Our method achieves this by drawing and counting the inputs at each prediction difference value in negative log-likelihood. Experiments reveal for the first time the quantitative prediction differences between LMs in a large input space, potentially facilitating the model analysis for applications such as model plagiarism.

Model-diff: A Tool for Comparative Study of Language Models in the Input Space

TL;DR

The paper tackles the problem of comparing two large language models beyond fixed benchmark datasets by analyzing their predictions over a broad, human-understandable input space defined via low negative log-likelihood (NLL) inputs. It introduces Model-diff, a sampling-based framework that constructs and compares the output-distribution of prediction differences with , employing two-stage sampling and normalization over the input-overlap . The approach is validated on toy data (where enumeration is feasible) and real-world models (GPT2, Llama families), and is shown to reveal when disagreements concentrate at large and how input types drive those disagreements. Practical applications demonstrated include model plagiarism detection and assessing which model better aligns with human annotations, offering a principled, scalable way to audit and compare language models without biased input design.

Abstract

Comparing two (large) language models (LMs) side-by-side and pinpointing their prediction similarities and differences on the same set of inputs are crucial in many real-world scenarios, e.g., one can test if a licensed model was potentially plagiarized by another. Traditional analysis compares the LMs' outputs on some benchmark datasets, which only cover a limited number of inputs of designed perspectives for the intended applications. The benchmark datasets cannot prepare data to cover the test cases from unforeseen perspectives which can help us understand differences between models unbiasedly. In this paper, we propose a new model comparative analysis setting that considers a large input space where brute-force enumeration would be infeasible. The input space can be simply defined as all token sequences that a LM would produce low perplexity on -- we follow this definition in the paper as it would produce the most human-understandable inputs. We propose a novel framework \our that uses text generation by sampling and deweights the histogram of sampling statistics to estimate prediction differences between two LMs in this input space efficiently and unbiasedly. Our method achieves this by drawing and counting the inputs at each prediction difference value in negative log-likelihood. Experiments reveal for the first time the quantitative prediction differences between LMs in a large input space, potentially facilitating the model analysis for applications such as model plagiarism.

Paper Structure

This paper contains 27 sections, 13 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of Model-diff with hypothetical models, code-LM (model $A$) and math-LM (model $B$). (a) code-LM assigns "i=i+1" (circled in orange ) a small output value (z=NLL) but math-LM assigns a large NLL. (b) The set of inputs $\mathbb X_A$ ($\mathbb X_B$) that model $A$ ($B$) maps to a predefined output range $\mathbb Z=[z_-,z_+]$. The Output distributions $\rho_A(z)$ and $\rho_B(z)$ are the count of inputs at each $z$. (c) Using each model's prediction $z_{A,\mathbf{x}}$ and $z_{B,\mathbf{x}}$ as two coordinate axes indicates the relation of models' predictions of the same input (e.g. $z_{A,\mathbf{x}} < z_{B,\mathbf{x}}$ for "i=i+1"). The number of inputs in R3 is used to normalize statistics for sampling (Sec. \ref{['sec:normalization']}). (d) Feeding the inputs in $\mathbb X_A$ to both models, compute each ${\mathcal{D}}=z_{A,\mathbf{x}}-z_{B,\mathbf{x}}$, and count the number of inputs to get $\rho_{A\rightarrow B}({\mathcal{D}})$ (red histogram). Repeat for model $B$ to get $\rho_{B \rightarrow A}({\mathcal{D}})$ (light blue histogram). The "i=i+1" is mapped to a very negative ${\mathcal{D}}$ value.
  • Figure 2: Comparing different language models using Model-diff on different input spaces. Except for (a), all the comparisons are done in the input space that a model believes to be reasonable human inputs by $\mathbb Z$.
  • Figure 3: GPT2-small-25 vs. its own by adding zero-mean noise on weight with different standard deviations (i.e., 0.001 and 0.00001). $\mathbb Z=(2.0,4.0)$.
  • Figure 4: Some presentative inputs from Llama2-25 (first 3 rows), GPT2-small-100 (middle 3 rows), and GPT2-medium-25 (last 3 rows). Each row begins with the NLL.
  • Figure 5: Some presentative inputs of different ${\mathcal{D}}$ values (first column) on the representative of GPT2-small-25 (indicated by "0" in the second column) or GPT2-medium-25 (indicated by "1" in the second column). Then the decoded input sentence(s) follows in the third column. Each group of rows separated by an empty row indicates representative inputs have similar ${\mathcal{D}}$.
  • ...and 1 more figures