Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI

Tian Yu Liu; Stefano Soatto; Matteo Marchi; Pratik Chaudhari; Paulo Tabuada

Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI

Tian Yu Liu, Stefano Soatto, Matteo Marchi, Pratik Chaudhari, Paulo Tabuada

TL;DR

This work formalizes the observability of LLMs viewed as dynamical systems, defining meanings as Nerode-like equivalence classes and feelings as self-contained, closed-loop mental trajectories evoked by perception or thought. It proves that standard token-based autoregressive transformers are observable, but introducing hidden system prompts can create non-observable, self-contained state trajectories, opening a pathway for backdoor-like behavior and Trojan-horse prompts. The authors provide a rigorous analysis plus extensive empirical validation on GPT-2 and LLaMA-2-7B across four prompt-model types, showing large indistinguishable state sets in many scenarios and demonstrating both average- and worst-case scenarios, including adversarial prompts and Trojan-horse demonstrations. The results have important security and transparency implications, suggesting design and governance strategies to prevent non-visible computations and backdoors while guiding future research on robust observability and safe deployment of generative AI systems.

Abstract

We tackle the question of whether Large Language Models (LLMs), viewed as dynamical systems with state evolving in the embedding space of symbolic tokens, are observable. That is, whether there exist multiple 'mental' state trajectories that yield the same sequence of generated tokens, or sequences that belong to the same Nerode equivalence class ('meaning'). If not observable, mental state trajectories ('experiences') evoked by an input ('perception') or by feedback from the model's own state ('thoughts') could remain self-contained and evolve unbeknown to the user while being potentially accessible to the model provider. Such "self-contained experiences evoked by perception or thought" are akin to what the American Psychological Association (APA) defines as 'feelings'. Beyond the lexical curiosity, we show that current LLMs implemented by autoregressive Transformers cannot have 'feelings' according to this definition: The set of state trajectories indistinguishable from the tokenized output is a singleton. But if there are 'system prompts' not visible to the user, then the set of indistinguishable trajectories becomes non-trivial, and there can be multiple state trajectories that yield the same verbalized output. We prove these claims analytically, and show examples of modifications to standard LLMs that engender such 'feelings.' Our analysis sheds light on possible designs that would enable a model to perform non-trivial computation that is not visible to the user, as well as on controls that the provider of services using the model could take to prevent unintended behavior.

Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI

TL;DR

Abstract

Paper Structure (20 sections, 4 theorems, 17 equations, 5 figures, 3 tables)

This paper contains 20 sections, 4 theorems, 17 equations, 5 figures, 3 tables.

Introduction
Related prior work
Caveats and Limitations
Formalization
Meanings and Feelings in Large Language Models
Analysis
Empirical Validation
Observability of the Hidden System Prompt Model
Average-Case Observability
Worst-Case Observability
Trojan horse behavior
Discussion
Proofs
Additional Experiments
Qualitative Output Visualization of Memory Models
...and 5 more sections

Key Result

Theorem 1

Consider an LLM described by eq:LLM, with $\pi$ and $\phi$ arbitrary deterministic maps. Then, for any $t > 0$, the last $t$ elements of the state $\mathbf{x}_{C-t+1:C}(t)$ are reconstructible. Further, the full state is reconstructible at any time $t \ge C$.

Figures (5)

Figure 1: Cardinality of indistinguishable sets $R_\tau(p)$ and $Q_\tau(p)$ in GPT-2 (left) and LLaMA-2-7B (right) for different prompts $p$ sampled from the SST-2 dataset. Neither Type 1 nor Type 2 prompt models are observable: For Type 1, $35\%$ latent state trajectories yield identical expressions for GPT-2, and $20\%$ for LLaMA-2-7B. For Type 2, the largest indistinguishable set comprises $70\%$ and $15\%$ of the latent state trajectories for GPT-2 and LLaMA-2-7B respectively. For visualization purposes, we shift $Q_{\tau}$ by +0.5 units on the y-axis, since the graphs for $Q_{\tau}$ and $R_{\tau}$ otherwise overlap.
Figure 2: One-Step Fading and Infinite Fading Memory Model on GPT-2 (left) and LLaMA-2-7B (right). For the former, the largest size of the indistinguishable set comprises around 80% and 30% of hidden state trajectories for GPT-2 and LLaMA-2-7B respectively. Note that $Q_1(p) = 1$ since the memory mechanism only kicks in at the first timestep. For visualization purposes we perturb $Q_{\tau}$ by +0.1 units on the y-axis, lest the graphs of $R_{\tau}$ and $Q_{\tau}$ overlap.
Figure 3: In this experiment on GPT-2, we compute $R_{\tau}(p)$ with Type 1 model on a 1000 element subset of $\mathcal{A}$, for various possible adversarial choices of $p$. The optimized adversarial prompt (Blue) is constructed via Eq. \ref{['eqn:adversarial-optim']} with $n =\tau=1$. Even though $\tau=1$ only maximizes the KL divergence of the token right after the adversarial prompt, this method of approximating $p^\ast$ dominates all other handcrafted choices. Further inspection (right, log-scale) reveals that the model is still not observable.
Figure 4: We apply the adversarial prompt, optimized via Eq. \ref{['eqn:adversarial-optim']} to distinguish between Type 1 system prompts, and show that it generalizes zero-shot to Left: Type 2 models, dominating all the handcrafted adversarial choices that we consider. However, such prompts do not generalize well towards Middle: Type 3, Right: Type 4 models.
Figure 5: Toy example of a Trojan Horse. in LLMs have been shown to be vulnerable to backdoor attacks. Short-term observability analysis conducted in this paper may be useful to understand potential misuse of LLMs. Real-world examples optimized on existing pre-trained LLMs can be found in Tab. \ref{['tab:trojan-horse-optimize']}

Theorems & Definitions (7)

Theorem 1
Corollary 1
Theorem 2
Corollary 2
proof
proof
proof

Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI

TL;DR

Abstract

Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)