Uncovering Latent Human Wellbeing in Language Model Embeddings
Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons
TL;DR
The study asks whether large language models implicitly encode human wellbeing in their embeddings without finetuning. It uses PCA to extract informative directions from last-layer embeddings and trains a logistic model to predict ETHICS Utilitarianism judgments, revealing that the top PCA component of text-embedding-ada-002 achieves 73.9% accuracy, rivaling finetuned BERT-large. Increasing the number of PCA components generally improves performance, with 300 components yielding 81.84% accuracy, and scaling benefits varying by model family. Paired evaluations offer modest gains over single evaluations, suggesting relative judgments are learnable but sensitive to configuration. These findings imply pretraining captures utility-related information and point to future work combining supervised/unsupervised embedding extraction and broader ethical benchmarks for robust wellbeing reasoning in AI systems.
Abstract
Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.
