View From Above: A Framework for Evaluating Distribution Shifts in Model Behavior
Tanush Chopra, Michael Li, Jacob Haimes
TL;DR
View From Above (VFA) introduces a domain-agnostic framework to detect distribution shifts in LLM decision-making by comparing observed outcomes $P_{ ext{observed}}$ with expected outcomes $P_{ ext{expected}}$ using $D_{KL}(P\|Q)$ and goodness-of-fit tests. The authors validate the approach in a controlled blackjack environment where the LLM controls card draws, showing significant shifts in card frequencies and final hand values across models and prompting regimes. These findings demonstrate measurable misalignment in learned representations and illustrate the framework's ability to reveal distributional anomalies not evident from individual outputs. The work highlights the importance of distribution-aware evaluation for safer LLM deployment and outlines future work toward human baselines and broader environments.
Abstract
When large language models (LLMs) are asked to perform certain tasks, how can we be sure that their learned representations align with reality? We propose a domain-agnostic framework for systematically evaluating distribution shifts in LLMs decision-making processes, where they are given control of mechanisms governed by pre-defined rules. While individual LLM actions may appear consistent with expected behavior, across a large number of trials, statistically significant distribution shifts can emerge. To test this, we construct a well-defined environment with known outcome logic: blackjack. In more than 1,000 trials, we uncover statistically significant evidence suggesting behavioral misalignment in the learned representations of LLM.
