An Affective-Taxis Hypothesis for Alignment and Interpretability

Eli Sennesh; Maxwell Ramstead

An Affective-Taxis Hypothesis for Alignment and Interpretability

Eli Sennesh, Maxwell Ramstead

TL;DR

The paper addresses AI alignment by arguing that human evaluative states arise from interoceptive, affective processes and cannot be captured by reward functions inferred solely from behavior. It proposes the affective-taxis hypothesis, linking affective valence to gradient-guided navigation in an internal interoceptive space, and situates this within evolutionary neuroscience and normative modeling via active inference and energy-based representations. The authors develop a computational framing that models affective landscapes as gradient-biased processes and discuss testing these ideas in a tractable model organism, C. elegans, where taxis can be studied without temporal associative learning. They delineate concrete research directions, including POMDP-like formulations, directional derivatives of attractant gradients, and inverse RL approaches to recover surface structures, aiming to enhance interpretability and alignment by grounding AI in biologically plausible affective dynamics. The work offers a biologically grounded, interpretable pathway toward aligning AI with human evaluative systems and lays out a roadmap for empirical validation and future extension toward human-level affective cognition.

Abstract

AI alignment is a field of research that aims to develop methods to ensure that agents always behave in a manner aligned with (i.e. consistently with) the goals and values of their human operators, no matter their level of capability. This paper proposes an affectivist approach to the alignment problem, re-framing the concepts of goals and values in terms of affective taxis, and explaining the emergence of affective valence by appealing to recent work in evolutionary-developmental and computational neuroscience. We review the state of the art and, building on this work, we propose a computational model of affect based on taxis navigation. We discuss evidence in a tractable model organism that our model reflects aspects of biological taxis navigation. We conclude with a discussion of the role of affective taxis in AI alignment.

An Affective-Taxis Hypothesis for Alignment and Interpretability

TL;DR

Abstract

An Affective-Taxis Hypothesis for Alignment and Interpretability

TL;DR

Abstract

Paper Structure

Table of Contents

Theorems & Definitions (1)