Towards Socially and Morally Aware RL agent: Reward Design With LLM
Zhaoyue Wang
TL;DR
This work investigates using Large Language Models (LLMs) as reward signals to steer reinforcement learning toward socially and morally aware behavior, addressing misalignment risks from imperfect objective specifications. The authors implement a grid-world testbed where an LLM provides proxy rewards, participates in trajectory comparisons, and, via a replay buffer, helps guide learning toward globally safe and norm-conforming policies. A precaution mechanism prompts the LLM to evaluate local consequences and modulate action selection, while the replay buffer facilitates overcoming local optima by favoring globally preferable trajectories. Across three grid-world scenarios, the approach enables convergence to safer, context-dependent strategies and demonstrates the LLM's ability to reflect social norms in decision-making. Limitations include the simplicity of the environment and static item effects, with future work aimed at scalable, probabilistic, and context-rich scenarios to better capture real-world moral complexity. $[-10,10]$ proxy rewards and trajectory-based updates illustrate a practical path for aligning RL with human values using language-model reasoning.
Abstract
When we design and deploy an Reinforcement Learning (RL) agent, reward functions motivates agents to achieve an objective. An incorrect or incomplete specification of the objective can result in behavior that does not align with human values - failing to adhere with social and moral norms that are ambiguous and context dependent, and cause undesired outcomes such as negative side effects and exploration that is unsafe. Previous work have manually defined reward functions to avoid negative side effects, use human oversight for safe exploration, or use foundation models as planning tools. This work studies the ability of leveraging Large Language Models (LLM)' understanding of morality and social norms on safe exploration augmented RL methods. This work evaluates language model's result against human feedbacks and demonstrates language model's capability as direct reward signals.
