Table of Contents
Fetching ...

Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about Real-World Entities

Miguel Zabaleta, Joel Lehman

TL;DR

LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon, and the conclusion is that LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon.

Abstract

Do horror writers have worse childhoods than other writers? Though biographical details are known about many writers, quantitatively exploring such a qualitative hypothesis requires significant human effort, e.g. to sift through many biographies and interviews of writers and to iteratively search for quantitative features that reflect what is qualitatively of interest. This paper explores the potential to quickly prototype these kinds of hypotheses through (1) applying LLMs to estimate properties of concrete entities like specific people, companies, books, kinds of animals, and countries; (2) performing off-the-shelf analysis methods to reveal possible relationships among such properties (e.g. linear regression); and towards further automation, (3) applying LLMs to suggest the quantitative properties themselves that could help ground a particular qualitative hypothesis (e.g. number of adverse childhood events, in the context of the running example). The hope is to allow sifting through hypotheses more quickly through collaboration between human and machine. Our experiments highlight that indeed, LLMs can serve as useful estimators of tabular data about specific entities across a range of domains, and that such estimations improve with model scale. Further, initial experiments demonstrate the potential of LLMs to map a qualitative hypothesis of interest to relevant concrete variables that the LLM can then estimate. The conclusion is that LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon.

Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about Real-World Entities

TL;DR

LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon, and the conclusion is that LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon.

Abstract

Do horror writers have worse childhoods than other writers? Though biographical details are known about many writers, quantitatively exploring such a qualitative hypothesis requires significant human effort, e.g. to sift through many biographies and interviews of writers and to iteratively search for quantitative features that reflect what is qualitatively of interest. This paper explores the potential to quickly prototype these kinds of hypotheses through (1) applying LLMs to estimate properties of concrete entities like specific people, companies, books, kinds of animals, and countries; (2) performing off-the-shelf analysis methods to reveal possible relationships among such properties (e.g. linear regression); and towards further automation, (3) applying LLMs to suggest the quantitative properties themselves that could help ground a particular qualitative hypothesis (e.g. number of adverse childhood events, in the context of the running example). The hope is to allow sifting through hypotheses more quickly through collaboration between human and machine. Our experiments highlight that indeed, LLMs can serve as useful estimators of tabular data about specific entities across a range of domains, and that such estimations improve with model scale. Further, initial experiments demonstrate the potential of LLMs to map a qualitative hypothesis of interest to relevant concrete variables that the LLM can then estimate. The conclusion is that LLMs offer intriguing potential to help illuminate scientifically interesting patterns latent within the internet-scale data they are trained upon.

Paper Structure

This paper contains 40 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: LLM-driven Dataset Simulation. Given a list of entities and properties, the method is to call an LLM for each combination of entity and property to simulate the value of the property for that entity.
  • Figure 2: Architecture of the hypothesis-driven simulation. The pipeline starts with a description of the hypothesis to explore, followed by the prompts that will generate the raw properties. After extracting the properties, the list of entities produces simulated data, which goes under a self-correction prompt for the final value.
  • Figure 3: Simulation accuracy for properties in the Zoo domain. Shown are how accurately the LLM is able to simulate each property in the Zoo domain across all the animals in the dataset. Accuracy is generally high, although the LLM understandably struggles with the ambigious variable name "catsize." The conclusion is that the approach is viable, although it is important to give the model sufficient context about the property it is to simulate.
  • Figure 4: Correlation coefficients by model size for Countries domain. Shown are how well the simulated properties correlate with the ground-truth properties across all entities. The conclusion is that the fidelity of the simulations improves with model scale and capability.
  • Figure 5: Scatter plots comparing simulated and real values for peak performance age and total major injuries. Dashed red line indicates perfect correspondence between real and simulated values.
  • ...and 6 more figures