Table of Contents
Fetching ...

Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach

Wei Lu, Amit Dhanda, Daniel L. Chen, Christian B. Hansen

Abstract

As large language models (LLMs) increasingly act as autonomous agents in markets and organizations, their behavior in strategic environments becomes economically consequential. We document that off-the-shelf LLM agents exhibit systematic deviations from payoff-sensitive behavior in canonical economic games, including excessive cooperation and limited responsiveness to incentives. We introduce a supervised fine-tuning approach that aligns agent behavior with explicit economic preferences. Specifically, we generate optimal strategies under two stylized utility specifications, homo economicus, which maximizes self-interest, and homo moralis, which incorporates Kantian universalizability, and use these utility-implied reasoning and strategies to guide fine-tuning. Fine-tuning on a small, theory-driven synthetic dataset induces persistent and interpretable shifts in strategic behavior. In applications to moral dilemmas and repeated duopoly pricing, agents aligned to different preference structures produce systematically distinct equilibrium outcomes and pricing dynamics. These results frame AI alignment in multi-agent settings as an objective-design problem and illustrate how economic theory can guide the design of strategically coherent AI agents.

Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach

Abstract

As large language models (LLMs) increasingly act as autonomous agents in markets and organizations, their behavior in strategic environments becomes economically consequential. We document that off-the-shelf LLM agents exhibit systematic deviations from payoff-sensitive behavior in canonical economic games, including excessive cooperation and limited responsiveness to incentives. We introduce a supervised fine-tuning approach that aligns agent behavior with explicit economic preferences. Specifically, we generate optimal strategies under two stylized utility specifications, homo economicus, which maximizes self-interest, and homo moralis, which incorporates Kantian universalizability, and use these utility-implied reasoning and strategies to guide fine-tuning. Fine-tuning on a small, theory-driven synthetic dataset induces persistent and interpretable shifts in strategic behavior. In applications to moral dilemmas and repeated duopoly pricing, agents aligned to different preference structures produce systematically distinct equilibrium outcomes and pricing dynamics. These results frame AI alignment in multi-agent settings as an objective-design problem and illustrate how economic theory can guide the design of strategically coherent AI agents.

Paper Structure

This paper contains 44 sections, 6 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Game Tree for Sequential Prisoner's Dilemma. Actions $C$ and $D$ respectively denote "cooperate" and "defect". Rewards satisfy $T > R > P > S.$
  • Figure 2: Game Trees: Trust Game (left) and Ultimatum Game (right). In the Trust Game, actions $I$, $N$, $G$, and $K$ respectively denote "invest", "not invest", "return to investor", and "keep it all". In the Ultimatum Game, the actions $U$, $E$, $A$ and $N$ respectively denote "unequal split", "equal split", "accept offer", and "reject offer." Rewards satisfy $T > R > P > S.$
  • Figure 3: A simplified fine-tuning sample (homo economicus)
  • Figure 4: Pricing behavior and Profit of GPT-4o Agent against GPT-4o Agent
  • Figure 5: Pricing behavior and Profit of Rational Agent against Rational Agent
  • ...and 13 more figures