Table of Contents
Fetching ...

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

Thom Lake, Eunsol Choi, Greg Durrett

TL;DR

The paper scrutinizes how alignment shifts LLM output distributions beyond mere usefulness by examining diversity and information content. Using open-ended QA datasets ConflictingQA and LIMA-OE, it shows alignment increases output quality and length while reducing lexical diversity, implying a move from distributional to Overton pluralism. The work demonstrates that aligned behavior can be elicited from base models via careful in-context prompting (in-context alignment), supporting the Superficial Alignment Hypothesis and enabling rapid personalization without fine-tuning. It also introduces in-context distillation strategies to mimic aligned responses and discusses the implications for rapid prototyping, while acknowledging limitations related to dataset scope and model scale.

Abstract

The alignment process changes several properties of a large language model's (LLM's) output distribution. We analyze two aspects of post-alignment distributional shift of LLM responses. First, we re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Alignment suppresses irrelevant and unhelpful content while shifting the output distribution toward longer responses that cover information spanning several responses from the base LLM, essentially presenting diverse information in a single response. Finding little evidence that alignment suppresses useful information, it is natural to ask the opposite question: do aligned models surface information that cannot be recovered from base models? Our second investigation shows this is not the case and the behavior of aligned models is recoverable from base models without fine-tuning. A combination of in-context examples and lower-resolution semantic hints about response content can elicit responses from base LLMs that are as similar to alignment-tuned LLM responses as alignment-tuned LLM responses are to each other. Taken together, these results indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior, providing further evidence for the Superficial Alignment Hypothesis. They also show that in-context alignment can go surprisingly far as a strategy for imitating aligned LLMs without fine-tuning. Our code and data is available at https://github.com/thomlake/investigating-alignment.

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

TL;DR

The paper scrutinizes how alignment shifts LLM output distributions beyond mere usefulness by examining diversity and information content. Using open-ended QA datasets ConflictingQA and LIMA-OE, it shows alignment increases output quality and length while reducing lexical diversity, implying a move from distributional to Overton pluralism. The work demonstrates that aligned behavior can be elicited from base models via careful in-context prompting (in-context alignment), supporting the Superficial Alignment Hypothesis and enabling rapid personalization without fine-tuning. It also introduces in-context distillation strategies to mimic aligned responses and discusses the implications for rapid prototyping, while acknowledging limitations related to dataset scope and model scale.

Abstract

The alignment process changes several properties of a large language model's (LLM's) output distribution. We analyze two aspects of post-alignment distributional shift of LLM responses. First, we re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Alignment suppresses irrelevant and unhelpful content while shifting the output distribution toward longer responses that cover information spanning several responses from the base LLM, essentially presenting diverse information in a single response. Finding little evidence that alignment suppresses useful information, it is natural to ask the opposite question: do aligned models surface information that cannot be recovered from base models? Our second investigation shows this is not the case and the behavior of aligned models is recoverable from base models without fine-tuning. A combination of in-context examples and lower-resolution semantic hints about response content can elicit responses from base LLMs that are as similar to alignment-tuned LLM responses as alignment-tuned LLM responses are to each other. Taken together, these results indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior, providing further evidence for the Superficial Alignment Hypothesis. They also show that in-context alignment can go surprisingly far as a strategy for imitating aligned LLMs without fine-tuning. Our code and data is available at https://github.com/thomlake/investigating-alignment.

Paper Structure

This paper contains 41 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparing outputs from an unaligned (left) and aligned (right) language model pair. A single response from the aligned model contains useful information only surfaced by the unaligned model with repeated sampling while omitting unhelpful content.
  • Figure 2: The relationship between lexical coverage, semantic coverage, response length, and helpfulness (x-axis) in LIMA-OE. Cover-LEX and Cover-SEM are computed with respect to Llama 2 Chat, with smaller values corresponding to more content missing from the aligned model response that the base model surfaces. When there is less overlap between the base and aligned model, base response responses are of lower quality. Helpful responses from the base model tend to cover the same content as the reference under both coverage metrics.
  • Figure 3: (a) Response stance distribution on the ConflictingQA dataset for Llama 2 models. Aligned models provide more comprehensive responses (Overton pluaralistic, both) than the base model, which mostly contains one perspective (yes/no). (b) Response stance entropy distribution. Aligned models also have higher consistency within samples (low entropy).
  • Figure 4: Histogram of maximum lexical similarity between responses for RLHFed model outputs and base model output with various ICL alignment techniques (left Llama 2, right Mistral). The top row depicts Self-Sim (max) for teacher model responses, which are fairly self-similar. Including additional context in the form of teacher responses and in-domain questions increases similarity to the teacher model, finally reaching a point with substantial distributional overlap.