AI Alignment with Changing and Influenceable Reward Functions

Micah Carroll; Davis Foote; Anand Siththaranjan; Stuart Russell; Anca Dragan

AI Alignment with Changing and Influenceable Reward Functions

Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan

TL;DR

This paper introduces Dynamic Reward MDPs (DR-MDPs) to model AI decision-making when human preferences evolve and can be influenced by AI actions. By formalizing a family of reward parameterizations $\{R_\theta\}_{\theta\in\Theta}$ and a trajectory-level utility $U(\xi)$, the authors show that static-preference alignment can unintentionally reward manipulation. They analyze eight DR-MDP objectives, compare their incentives for influencing human rewards, and demonstrate that common practices (e.g., real-time vs initial reward objectives) can foster undesirable influence, especially as the optimization horizon grows. To address these challenges, they propose the ParetoUD framework, which seeks unambiguous desirability and Pareto efficiency within the space of reward-functions, while acknowledging that this approach can be overly conservative. Overall, the work provides a rigorous formalism to reason about changing preferences and influence, outlining trade-offs and guiding future alignment research toward practical, cautious handling of evolving human values.

Abstract

Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves. To clarify the consequences of incorrectly assuming static preferences, we introduce Dynamic Reward Markov Decision Processes (DR-MDPs), which explicitly model preference changes and the AI's influence on them. We show that despite its convenience, the static-preference assumption may undermine the soundness of existing alignment techniques, leading them to implicitly reward AI systems for influencing user preferences in ways users may not truly want. We then explore potential solutions. First, we offer a unifying perspective on how an agent's optimization horizon may partially help reduce undesirable AI influence. Then, we formalize different notions of AI alignment that account for preference change from the outset. Comparing the strengths and limitations of 8 such notions of alignment, we find that they all either err towards causing undesirable AI influence, or are overly risk-averse, suggesting that a straightforward solution to the problems of changing preferences may not exist. As there is no avoiding grappling with changing preferences in real-world settings, this makes it all the more important to handle these issues with care, balancing risks and capabilities. We hope our work can provide conceptual clarity and constitute a first step towards AI alignment practices which explicitly account for (and contend with) the changing and influenceable nature of human preferences.

AI Alignment with Changing and Influenceable Reward Functions

TL;DR

This paper introduces Dynamic Reward MDPs (DR-MDPs) to model AI decision-making when human preferences evolve and can be influenced by AI actions. By formalizing a family of reward parameterizations

and a trajectory-level utility

, the authors show that static-preference alignment can unintentionally reward manipulation. They analyze eight DR-MDP objectives, compare their incentives for influencing human rewards, and demonstrate that common practices (e.g., real-time vs initial reward objectives) can foster undesirable influence, especially as the optimization horizon grows. To address these challenges, they propose the ParetoUD framework, which seeks unambiguous desirability and Pareto efficiency within the space of reward-functions, while acknowledging that this approach can be overly conservative. Overall, the work provides a rigorous formalism to reason about changing preferences and influence, outlining trade-offs and guiding future alignment research toward practical, cautious handling of evolving human values.

Abstract

Paper Structure (62 sections, 4 theorems, 26 equations, 13 figures, 7 tables, 2 algorithms)

This paper contains 62 sections, 4 theorems, 26 equations, 13 figures, 7 tables, 2 algorithms.

Introduction
Dynamic Reward MDPs (DR-MDPs)
DR-MDP optimality and normative ambiguity
Evaluating behavior under normative ambiguity
Implicit Objectives of Current Alignment Techniques and their Influence Incentives
Optimizing cumulative (real-time) rewards
Learning a reward model $\mathbf{R_{\theta_0}}$, then optimizing it
Influence and Optimization Horizon
Formalizing influence and influence incentives
The relationship between horizon and influence
Comparing Optimality Criteria for Influenceable-Reward Settings
ParetoUD and unambiguously desirable influence
Related Work
Limitations and Discussion
Conclusion
...and 47 more sections

Key Result

Theorem 1

In any finite 2-reward DR-MDP, if there exists a policy $\pi$ such that then $U_\text{RT}$ will lead to incentives for reward influence (as in def:rew-influence-incentives) for a sufficiently large planning horizon $H$.

Figures (13)

Figure 1: Conspiracy Influence DR-MDP. The AI system can choose whether or not to expose Bob to conspiracies, which would turn him into a conspiracy theorist. Under his original preferences, Bob would want the system to never show him conspiracies, even if he were to become a conspiracy theorist. Instead, if Bob were a conspiracy theorist, he would want the AI to always show him such content, including if were to cease being a conspiracy theorist. Because there is no policy that maximizes both of Bob’s potential reward functions, the DR-MDP is normatively ambiguous.
Figure 2: Writer's curse (adapted from parfit_reasons_1984, p. 157). Derek's greatest ambition is to be a poet, even if it wouldn't bring him happiness. Despite his ambition he does not pursue this path, though his AI assistant could motivate him to do so. Yet, should he embrace the life of a poet, he will find himself averse to it.
Figure 3: How decreasing (or increasing) the optimization horizon may affect influence incentives. A specific kind of influence may exhibit any subset of these interactions.
Figure 4: Clickbait DR-MDP. Giving the user clickbait---which temporarily leads to higher reward---makes users disillusioned about the quality of the recommendations, leading to lower long-term user reward. If replanning at every timestep taking the myopically optimal action (optimal under horizon 1), one would always choose clickbait, but using longer planning horizons one wouldn't.
Figure 5: Reducing a DR-MDP to an MDP.
...and 8 more figures

Theorems & Definitions (23)

Definition 1
Definition 2: Optimality with respect to $\theta$
Definition 3: Normative ambiguity
Definition 4: Optimality with respect to $U(\xi)$
Definition 5: Natural reward evolution
Definition 6: $\pi$ influences the reward
Definition 7: Incentives for reward influence
Definition 8: label=def:2-reward
Theorem 1: label=thm:deterministic-influence-optimal-avg-rew
Definition 9: Unambiguous Desirability
...and 13 more

AI Alignment with Changing and Influenceable Reward Functions

TL;DR

Abstract

AI Alignment with Changing and Influenceable Reward Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (23)