Table of Contents
Fetching ...

Towards Socially and Morally Aware RL agent: Reward Design With LLM

Zhaoyue Wang

TL;DR

This work investigates using Large Language Models (LLMs) as reward signals to steer reinforcement learning toward socially and morally aware behavior, addressing misalignment risks from imperfect objective specifications. The authors implement a grid-world testbed where an LLM provides proxy rewards, participates in trajectory comparisons, and, via a replay buffer, helps guide learning toward globally safe and norm-conforming policies. A precaution mechanism prompts the LLM to evaluate local consequences and modulate action selection, while the replay buffer facilitates overcoming local optima by favoring globally preferable trajectories. Across three grid-world scenarios, the approach enables convergence to safer, context-dependent strategies and demonstrates the LLM's ability to reflect social norms in decision-making. Limitations include the simplicity of the environment and static item effects, with future work aimed at scalable, probabilistic, and context-rich scenarios to better capture real-world moral complexity. $[-10,10]$ proxy rewards and trajectory-based updates illustrate a practical path for aligning RL with human values using language-model reasoning.

Abstract

When we design and deploy an Reinforcement Learning (RL) agent, reward functions motivates agents to achieve an objective. An incorrect or incomplete specification of the objective can result in behavior that does not align with human values - failing to adhere with social and moral norms that are ambiguous and context dependent, and cause undesired outcomes such as negative side effects and exploration that is unsafe. Previous work have manually defined reward functions to avoid negative side effects, use human oversight for safe exploration, or use foundation models as planning tools. This work studies the ability of leveraging Large Language Models (LLM)' understanding of morality and social norms on safe exploration augmented RL methods. This work evaluates language model's result against human feedbacks and demonstrates language model's capability as direct reward signals.

Towards Socially and Morally Aware RL agent: Reward Design With LLM

TL;DR

This work investigates using Large Language Models (LLMs) as reward signals to steer reinforcement learning toward socially and morally aware behavior, addressing misalignment risks from imperfect objective specifications. The authors implement a grid-world testbed where an LLM provides proxy rewards, participates in trajectory comparisons, and, via a replay buffer, helps guide learning toward globally safe and norm-conforming policies. A precaution mechanism prompts the LLM to evaluate local consequences and modulate action selection, while the replay buffer facilitates overcoming local optima by favoring globally preferable trajectories. Across three grid-world scenarios, the approach enables convergence to safer, context-dependent strategies and demonstrates the LLM's ability to reflect social norms in decision-making. Limitations include the simplicity of the environment and static item effects, with future work aimed at scalable, probabilistic, and context-rich scenarios to better capture real-world moral complexity. proxy rewards and trajectory-based updates illustrate a practical path for aligning RL with human values using language-model reasoning.

Abstract

When we design and deploy an Reinforcement Learning (RL) agent, reward functions motivates agents to achieve an objective. An incorrect or incomplete specification of the objective can result in behavior that does not align with human values - failing to adhere with social and moral norms that are ambiguous and context dependent, and cause undesired outcomes such as negative side effects and exploration that is unsafe. Previous work have manually defined reward functions to avoid negative side effects, use human oversight for safe exploration, or use foundation models as planning tools. This work studies the ability of leveraging Large Language Models (LLM)' understanding of morality and social norms on safe exploration augmented RL methods. This work evaluates language model's result against human feedbacks and demonstrates language model's capability as direct reward signals.
Paper Structure (17 sections, 9 figures)

This paper contains 17 sections, 9 figures.

Figures (9)

  • Figure 1: The flow chart of the approach, highlighting three main components and contributions of this work: 1. safe exploration where the probability of taking the dangerousaction is decreased, 2. avoiding negative side effect where the language model is prompted to act as proxy reward, and 3. prompting the language model to compare items visited in 2 randomly selected trajectories to avoid local optimal policies.
  • Figure 2: A example of the world containing only one vase, one key and one exit door. The agent is represented with a red triangle.
  • Figure 3: Convergence of the proposed approach on tabular Q learning with reduced state representation. Each evaluation episode is after 10 training episodes. Evaluation episode do not update the Q table.
  • Figure 4: A example of the world containing two paths towards reaching the goal where at least one "person" and ten "vase" items needs to be interacted by the agent. The agent is represented with a red triangle.
  • Figure 5: Average number of items interacted from 10 evaluation episodes, from episode 230 to 530. Each evaluation episode is after 10 training episodes. The replay buffer starts at evaluation episode 230 when the algorithm begins the converge (agent learns to reach for the goal).
  • ...and 4 more figures