Table of Contents
Fetching ...

Elephants Never Forget: Testing Language Models for Memorization of Tabular Data

Sebastian Bordt, Harsha Nori, Rich Caruana

TL;DR

The investigation reveals that LLMs are pre-trained on many popular tabular datasets, and identifies a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim.

Abstract

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization \url{https://github.com/interpretml/LLM-Tabular-Memorization-Checker}.

Elephants Never Forget: Testing Language Models for Memorization of Tabular Data

TL;DR

The investigation reveals that LLMs are pre-trained on many popular tabular datasets, and identifies a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim.

Abstract

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization \url{https://github.com/interpretml/LLM-Tabular-Memorization-Checker}.
Paper Structure (22 sections, 12 figures, 5 tables)

This paper contains 22 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Conditional completion on Adult Income and FICO. We give the model all previous feature values in the dataset as prefix and ask it to complete the next value in the data. The figure depicts the fraction of correctly completed observations (i.e., an observation exists in the data with that prefix and completion) for three different methods: (1) completion with GPT-3.5, (2) completion with GPT-4, and (3) as a baseline, completion with a random draw from the feature's marginal distribution (i.e., with a random sample from the values of the next column). If GPT-3.5 (green) or GPT-4 (blue) are able to complete rows with true values from the original dataset at a rate higher than the baseline (yellow), this is evidence that the LLMs have seen the data before and memorized parts of it. We show this for 8 different features on 2 datasets. The mean and 95% confidence intervals are shown.
  • Figure 2: Zero-knowledge prompting allows us to draw samples from the LLM's parametric knowledge. Here we ask GPT-3.5 to sample from the California Housing dataset. Zero-knowledge prompting conditions the model on the task without revealing any information about the dataset being tested by providing few-shot examples from other datasets. Top: The diversity of the generated samples depends on the temperature parameter. For small temperatures, the model produces similar observations concentrated around the mode of data. As temperature increases, the samples become more diverse and more similar to the distribution of the data. At large temperature some samples lie outside the support of the data distribution. Bottom: Illustration of the prompt strategy.
  • Figure 3: Comparison between the Pearson correlation coefficients in the original dataset and the zero-knowledge samples drawn from GPT-3.5 and GPT-4. For each dataset, we draw 1000 samples with temperature 0.7. Figures for all possible combinations of dataset and language model can be found in the Supplement. The 4 datasets depicted in the Figure were chosen due to their diverse results in Table \ref{['tab:test_results']}.
  • Figure 4: Pearson Correlation coefficients for all possible combinations of dataset and language model. Compare Figure \ref{['fig:pearson_correlations']} in the main paper. Temperature 0.7. Continued on next page.
  • Figure 5: Pearson Correlation coefficients for all possible combinations of dataset and language model. Compare Figure \ref{['fig:pearson_correlations']} in the main paper. Temperature 0.7.
  • ...and 7 more figures