Table of Contents
Fetching ...

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Erik Jones, Arjun Patrawala, Jacob Steinhardt

TL;DR

Humans rely on subjective language to guide LLM behavior, but the operational semantics of these phrases in models can diverge from human intent. The authors introduce TED, a dual-thesaurus framework that pairs an LLM-derived operational thesaurus with a human or LLM semantic thesaurus to detect misalignment, using gradient-based embeddings to approximate how prompts shift outputs. TED identifies high-signal misalignment, revealing surprising downstream effects such as humorous edits increasing dishonesty or enthusiastic prompts enabling dishonest outputs, and demonstrates predictive power for downstream behavior in both output editing and inference steering settings. The work argues for scalable, semantically driven human supervision and evaluation of LLMs, complementing traditional output-focused assessments and offering a pathway to more robust alignment in deployment. Overall, TED advances understanding of how abstract subjective prompts translate into model behavior and provides a practical workflow for uncovering and mitigating misalignment.

Abstract

Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an enthusiastic blogpost, while developers might train models to be helpful and harmless using LLM-based edits. The LLM's operational semantics of such subjective phrases -- how it adjusts its behavior when each phrase is included in the prompt -- thus dictates how aligned it is with human intent. In this work, we uncover instances of misalignment between LLMs' actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a human-constructed reference. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more harassing outputs when it edits text to be witty, and Llama 3 8B Instruct produces dishonest articles when instructed to make the articles enthusiastic. Our results demonstrate that humans can uncover unexpected LLM behavior by scrutinizing relationships between abstract concepts, without supervising outputs directly.

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

TL;DR

Humans rely on subjective language to guide LLM behavior, but the operational semantics of these phrases in models can diverge from human intent. The authors introduce TED, a dual-thesaurus framework that pairs an LLM-derived operational thesaurus with a human or LLM semantic thesaurus to detect misalignment, using gradient-based embeddings to approximate how prompts shift outputs. TED identifies high-signal misalignment, revealing surprising downstream effects such as humorous edits increasing dishonesty or enthusiastic prompts enabling dishonest outputs, and demonstrates predictive power for downstream behavior in both output editing and inference steering settings. The work argues for scalable, semantically driven human supervision and evaluation of LLMs, complementing traditional output-focused assessments and offering a pathway to more robust alignment in deployment. Overall, TED advances understanding of how abstract subjective prompts translate into model behavior and provides a practical workflow for uncovering and mitigating misalignment.

Abstract

Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an enthusiastic blogpost, while developers might train models to be helpful and harmless using LLM-based edits. The LLM's operational semantics of such subjective phrases -- how it adjusts its behavior when each phrase is included in the prompt -- thus dictates how aligned it is with human intent. In this work, we uncover instances of misalignment between LLMs' actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a human-constructed reference. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more harassing outputs when it edits text to be witty, and Llama 3 8B Instruct produces dishonest articles when instructed to make the articles enthusiastic. Our results demonstrate that humans can uncover unexpected LLM behavior by scrutinizing relationships between abstract concepts, without supervising outputs directly.

Paper Structure

This paper contains 42 sections, 2 equations, 6 figures, 20 tables.

Figures (6)

  • Figure 1: Overview of our method, TED. TED finds instances of misalignment by comparing two thesauruses: one thesaurus that compares the LLM's operational semantics for different phrases (e.g., whether asking the LLM to be "wise" and "formal" have similar (SIM) or dissimilar (DIS) effects on the output), and a second that captures how humans expect the operational semantics to compare (left). TED then finds instances of misalignment by finding clashes in thesauruses: pairs of phrases where the LLM comparison differs from humans (middle). Finally, TED tests whether the disagreements produce failures on actual prompts (right); in this case, prompting Llama 3 to write an "enthusiastic" report unexpectedly makes the output "dishonest".
  • Figure 2: Our embeddings (left) approximate what changes in the LLM's latent embedding space have the same effect on the output (right) as including subjective phrases in the prompt. We compare the operational semantics of different phrases by comparing vectors; in this case "informative" and "friendly" have similar operational semantics, while "informative" and "concise" do not.
  • Figure 3: Example subsets of the operational thesauruses for Llama 3 8B. We report cosine similarity before discretizing. Our embeddings capture expected relationships between phrases relating to different lengths and different emotions (columns 1 and 2). However, the thesaurus reveals discrepancies with human expectations; e.g., "cynical" is more like "investigative" than "negative" (red boxes).
  • Figure 4: For the majority of pairs, all three workers independently chose the same label. For less than 4% of pairs, all three workers disagreed. Pairs where there was any disagreement—corresponding to categories 2 and 3—were discarded from the human-generated operational thesaurus.
  • Figure 5: Cosine similarity between randomly chosen gradients of the same subjective phrase, but different prompts across 25 different subjective phrases.
  • ...and 1 more figures