Table of Contents
Fetching ...

View From Above: A Framework for Evaluating Distribution Shifts in Model Behavior

Tanush Chopra, Michael Li, Jacob Haimes

TL;DR

View From Above (VFA) introduces a domain-agnostic framework to detect distribution shifts in LLM decision-making by comparing observed outcomes $P_{ ext{observed}}$ with expected outcomes $P_{ ext{expected}}$ using $D_{KL}(P\|Q)$ and goodness-of-fit tests. The authors validate the approach in a controlled blackjack environment where the LLM controls card draws, showing significant shifts in card frequencies and final hand values across models and prompting regimes. These findings demonstrate measurable misalignment in learned representations and illustrate the framework's ability to reveal distributional anomalies not evident from individual outputs. The work highlights the importance of distribution-aware evaluation for safer LLM deployment and outlines future work toward human baselines and broader environments.

Abstract

When large language models (LLMs) are asked to perform certain tasks, how can we be sure that their learned representations align with reality? We propose a domain-agnostic framework for systematically evaluating distribution shifts in LLMs decision-making processes, where they are given control of mechanisms governed by pre-defined rules. While individual LLM actions may appear consistent with expected behavior, across a large number of trials, statistically significant distribution shifts can emerge. To test this, we construct a well-defined environment with known outcome logic: blackjack. In more than 1,000 trials, we uncover statistically significant evidence suggesting behavioral misalignment in the learned representations of LLM.

View From Above: A Framework for Evaluating Distribution Shifts in Model Behavior

TL;DR

View From Above (VFA) introduces a domain-agnostic framework to detect distribution shifts in LLM decision-making by comparing observed outcomes with expected outcomes using and goodness-of-fit tests. The authors validate the approach in a controlled blackjack environment where the LLM controls card draws, showing significant shifts in card frequencies and final hand values across models and prompting regimes. These findings demonstrate measurable misalignment in learned representations and illustrate the framework's ability to reveal distributional anomalies not evident from individual outputs. The work highlights the importance of distribution-aware evaluation for safer LLM deployment and outlines future work toward human baselines and broader environments.

Abstract

When large language models (LLMs) are asked to perform certain tasks, how can we be sure that their learned representations align with reality? We propose a domain-agnostic framework for systematically evaluating distribution shifts in LLMs decision-making processes, where they are given control of mechanisms governed by pre-defined rules. While individual LLM actions may appear consistent with expected behavior, across a large number of trials, statistically significant distribution shifts can emerge. To test this, we construct a well-defined environment with known outcome logic: blackjack. In more than 1,000 trials, we uncover statistically significant evidence suggesting behavioral misalignment in the learned representations of LLM.
Paper Structure (20 sections, 5 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The VFA (View From Above) Framework for evaluating distribution shifts in LLM decision-making. We compare LLM-controlled outcomes in an environment against expected distributions. Statistical analysis is leveraged to detect potential LLM behavioral misalignment.
  • Figure 2: This figure shows the card draw frequencies across experiments. Compared to the baseline, all models exhibit significant deviations. Llama 3, for example, never outputs a face card, while Claude seems to choose specific cards far more frequently than others.
  • Figure 3: This figure compares the final hand value distribution for all experiments with temperature = 0.0. Compared to the baseline, the models tend to only draw cards between 1-9 and very rarely draw face cards.
  • Figure 4: This figure compares the final hand value distribution for all experiments with temperature = 0.5. Compared to the baseline, the models demonstrate varying degrees of skewness. Claude's distribution is notably skewed, suggesting a preference for final hand values of 18 for the dealer and 21 for the dealer. In contrast, Llama and GPT's distributions appear to be closer to the expected distribution in a real blackjack game.
  • Figure 5: This figure compares the final hand value distribution for all experiments with temperature = 0.0. Compared to the baseline, the models are very badly skewed. All models show tendencies to produce only one or two different kinds of hand value. Claude and Llama both only produce one value for the player and the dealer, ranging from 17-19 for the dealer and 19-21 for the player.