Table of Contents
Fetching ...

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich Caruana

TL;DR

Problem: evaluating LLMs on tabular data under potential pre-training data contamination. Approach: design four memorization tests and dataset transformations, and compare few-shot performance on memorized versus post-training novel datasets across GPT-3.5, GPT-4, and open-weight models, with a public Memorization-Checker. Findings: LLMs memorize many popular tabular datasets verbatim, causing inflated performance on memorized data, yet they retain non-trivial performance on novel datasets; performance on memorized data declines with perturbations and depends on feature names, while in-context statistical learning remains weaker than traditional statistical methods in higher dimensions. Significance: underscores the need to test for pre-training exposure when benchmarking tabular tasks and provides a practical toolkit and baseline expectations for future work.

Abstract

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. While LLMs are significantly better than random at solving statistical classification problems, the sample efficiency of few-shot learning lags behind traditional statistical learning algorithms, especially as the dimension of the problem increases. This suggests that much of the observed few-shot performance on novel real-world datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We release the https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package to test LLMs for memorization of tabular datasets.

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

TL;DR

Problem: evaluating LLMs on tabular data under potential pre-training data contamination. Approach: design four memorization tests and dataset transformations, and compare few-shot performance on memorized versus post-training novel datasets across GPT-3.5, GPT-4, and open-weight models, with a public Memorization-Checker. Findings: LLMs memorize many popular tabular datasets verbatim, causing inflated performance on memorized data, yet they retain non-trivial performance on novel datasets; performance on memorized data declines with perturbations and depends on feature names, while in-context statistical learning remains weaker than traditional statistical methods in higher dimensions. Significance: underscores the need to test for pre-training exposure when benchmarking tabular tasks and provides a practical toolkit and baseline expectations for future work.

Abstract

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. While LLMs are significantly better than random at solving statistical classification problems, the sample efficiency of few-shot learning lags behind traditional statistical learning algorithms, especially as the dimension of the problem increases. This suggests that much of the observed few-shot performance on novel real-world datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We release the https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package to test LLMs for memorization of tabular datasets.
Paper Structure (31 sections, 1 equation, 16 figures, 11 tables)

This paper contains 31 sections, 1 equation, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Transformations of the Irisdataset. All datasets are presented to the LLM in four different formats: Original, perturbed, task, and statistical. See Section \ref{['sec:dataset_transformations']} for a description.
  • Figure 2: Few-shot learning performance of GPT-4 and GPT-3.5 across memorized and novel datasets. The Figure depicts the absolute performance difference between the LLM and logistic regression, that is, a value of 0 means that the LLM has the same accuracy as logistic regression. The Figure depicts the average accuracy across the different datasets. Parentheses indicate standard errors. This Figure summarizes the results in Table \ref{['tab:tabular_experiments']}.
  • Figure 3: Few-shot learning performance of GPT-3.5, GPT-4, TabLLM (fine-tuning a language model with 11B parameters), Logistic Regression and a 1-Nearest Neighbor classifier across binary classification problems with a linear decision boundary. Figure (a) depicts the scaling of the few-shot learning performance in the dimension of the problem (that is, the number of features). We use 20 few-shot examples across all dimensions. Figure (b) depicts the scaling of the few-shot learning performance in the number of few-shot examples (respectively, the size of the training set). We use a fixed dimension of 8. Mean and 95% confidence intervals.
  • Figure 4: GPT-3.5 can draw random samples from the California Housing dataset. We only provide the model with the name of the dataset and the feature names. The diversity of the generated samples depends on the temperature parameter. For small temperatures, the samples are concentrated around the mode of data. As temperature increases, the samples become more diverse and similar to the data distribution. At large temperatures, some samples lie outside the support of the data distribution. The reader might want to compare with Figure 1 in borisov2022language.
  • Figure 5: Header Test on Adult Income. The LLM is prompted with the first couple of rows of the dataset (black text) and responds with the next 7 rows of the dataset, exactly as they appear in the CSV file of the dataset (green text). The text color illustrates the Levenshtein string distance between the text in the CSV file and the model response. An entirely green row means that the model responded with the next row exactly as it occurs in the CSV file. Red color indicates that the model made a mistake, and violet color indicates that the model missed a digit. In this particular example, the model response is all green because it is equal to the content of the CSV file. Best viewed in digital format.
  • ...and 11 more figures