Table of Contents
Fetching ...

Towards Measuring Goal-Directedness in AI Systems

Dylan Xu, Juan-Pablo Rivera

TL;DR

A definition of goal-directedness is proposed that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals.

Abstract

Recent advances in deep learning have brought attention to the possibility of creating advanced, general AI systems that outperform humans across many tasks. However, if these systems pursue unintended goals, there could be catastrophic consequences. A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner in the first place, optimizing for some unknown goal; there exists significant research trying to evaluate systems for said behaviors. However, the most rigorous definitions of goal-directedness we currently have are difficult to compute in real-world settings. Drawing upon this previous literature, we explore policy goal-directedness within reinforcement learning (RL) environments. In our findings, we propose a different family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many (sparse) reward functions. We operationalize this preliminary definition of goal-directedness and test it in toy Markov decision process (MDP) environments. Furthermore, we explore how goal-directedness could be measured in frontier large-language models (LLMs). Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals. We recommend further exploration of measuring coherence and goal-directedness, based on our findings.

Towards Measuring Goal-Directedness in AI Systems

TL;DR

A definition of goal-directedness is proposed that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals.

Abstract

Recent advances in deep learning have brought attention to the possibility of creating advanced, general AI systems that outperform humans across many tasks. However, if these systems pursue unintended goals, there could be catastrophic consequences. A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner in the first place, optimizing for some unknown goal; there exists significant research trying to evaluate systems for said behaviors. However, the most rigorous definitions of goal-directedness we currently have are difficult to compute in real-world settings. Drawing upon this previous literature, we explore policy goal-directedness within reinforcement learning (RL) environments. In our findings, we propose a different family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many (sparse) reward functions. We operationalize this preliminary definition of goal-directedness and test it in toy Markov decision process (MDP) environments. Furthermore, we explore how goal-directedness could be measured in frontier large-language models (LLMs). Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals. We recommend further exploration of measuring coherence and goal-directedness, based on our findings.
Paper Structure (38 sections, 14 equations, 14 figures)

This paper contains 38 sections, 14 equations, 14 figures.

Figures (14)

  • Figure 1: Accuracy and loss when passing in three different sets of features into the logistic classifier for predicting $\frac{P(\pi = \pi_0 | URS)}{P(\pi = \pi_0 | UPS)}$. Columns: (P) = policy (plus the flattened transition matrix and discount rate, although in practice it does not make a difference); (LL) = distance to loop and length of loop for the policy at each starting state $\pi(s_0)$; (O, S) = the sum of out-arrows, or states reachable from $s_0$, and whether $\pi(s_0)=s_0$ is a self-loop for any $s_0 \in S$. All error bars in the MDP experiments assume a normal distribution and show the two-sigma error of the independent generation of 30 classifiers for each category. All error bars in the MDP and RL experiments were calculated with calls to NumPy functions.
  • Figure 2: Ablations for the LLM experiments. We train a classifier here on activations from two versions of llama-2-7b fine-tuned on GSM8k with sparse loss using one random token in the sequence and dense loss using all tokens respectively. We then tested the classifier on activations from models fine-tuned on a different dataset (Orca math) with sparse loss on one token, sparse loss on ten tokens, and dense loss. The classifier correctly separates these cases and places the denser ten-token-trained models between the 1-token and dense models. The right figure is generated similarly, except the sparsely-trained model creating data points for the classifier is trained on ten random tokens in the sequence. The classifier correctly separates these cases and identifies activations from models trained on one random token per sequence (which was not in the training distribution) as more sparse than ten-token-trained models. Rerunning the experiments with a different seed produces negligible variation in results.
  • Figure 3: Accuracy and loss when passing in three different sets of features into the logistic classifier for predicting $\frac{P(\pi = \pi_0 | USS)}{P(\pi = \pi_0 | UPS)}$.
  • Figure 4: Accuracy and loss when passing in three different sets of features into the logistic classifier for predicting $\frac{P(\pi = \pi_0 | USS)}{P(\pi = \pi_0 | URS)}$.
  • Figure 5: Bigger version of figure \ref{['fig:main1']}
  • ...and 9 more figures

Theorems & Definitions (1)

  • Definition A.1