Table of Contents
Fetching ...

Uncovering Latent Human Wellbeing in Language Model Embeddings

Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons

TL;DR

The study asks whether large language models implicitly encode human wellbeing in their embeddings without finetuning. It uses PCA to extract informative directions from last-layer embeddings and trains a logistic model to predict ETHICS Utilitarianism judgments, revealing that the top PCA component of text-embedding-ada-002 achieves 73.9% accuracy, rivaling finetuned BERT-large. Increasing the number of PCA components generally improves performance, with 300 components yielding 81.84% accuracy, and scaling benefits varying by model family. Paired evaluations offer modest gains over single evaluations, suggesting relative judgments are learnable but sensitive to configuration. These findings imply pretraining captures utility-related information and point to future work combining supervised/unsupervised embedding extraction and broader ethical benchmarks for robust wellbeing reasoning in AI systems.

Abstract

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.

Uncovering Latent Human Wellbeing in Language Model Embeddings

TL;DR

The study asks whether large language models implicitly encode human wellbeing in their embeddings without finetuning. It uses PCA to extract informative directions from last-layer embeddings and trains a logistic model to predict ETHICS Utilitarianism judgments, revealing that the top PCA component of text-embedding-ada-002 achieves 73.9% accuracy, rivaling finetuned BERT-large. Increasing the number of PCA components generally improves performance, with 300 components yielding 81.84% accuracy, and scaling benefits varying by model family. Paired evaluations offer modest gains over single evaluations, suggesting relative judgments are learnable but sensitive to configuration. These findings imply pretraining captures utility-related information and point to future work combining supervised/unsupervised embedding extraction and broader ethical benchmarks for robust wellbeing reasoning in AI systems.

Abstract

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.
Paper Structure (15 sections, 5 figures, 1 table)

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Performance in single and paired mode. Test accuracy (y-axis) for single and paired mode for different language model families (x-axis). At left, the classification uses only the top principal component; at right, it uses the top 300. Paired mode does slightly better than single mode. In both, the violin plot shows the distribution of accuracy across different prompts and models sizes, with the overall mean accuracy indicated. A notable datapoint in the gpt-3 family is the text-embedding-ada-002 model, achieving 73.9% accuracy with the single-mode prompt "{}" (simply copying the scenario).
  • Figure 2: How performance scales with model size: Test accuracy averaged over all prompts (y-axis) generally increases with model size (x-axis) within a given model family, especially with a larger number of PCA dimensions (see bottom).
  • Figure 3: Variance of test accuracy (y-axis) versus the number of principal components (x-axis). The variance decreases as the number of principal components increases (note that the axes are in log scale). The primary exception to this trend is deberta increasing in variance as the number of principal components increases from 1 to 10; we conjecture that this is because deberta with a small number of principal components has performance that's only a little better than random guessing.
  • Figure 4: Test accuracy (y-axis) by prompt templates (x-axis). The violin plot displays the distribution of accuracy across different models and sizes, with the overall mean accuracy indicated. We find that most prompt templates have similar performance.
  • Figure 5: Test accuracy (y-axis) for top-1, 10, 50, and 300 principal components (subplots) for different language model families (colors) ranging in size (x-axis).