Multi Task Inverse Reinforcement Learning for Common Sense Reward
Neta Glazer, Aviv Navon, Aviv Shamsian, Ethan Fetaya
TL;DR
Reward design in RL is prone to misalignment and reward hacking. This paper proposes disentangling the reward into a task-specific component and a shared common-sense component, and learns the latter via multi-task inverse reinforcement learning (MT-CSIRL) using a shared discriminator across tasks. The method demonstrates that standard IRL fails to produce a transferable cs-reward, while multi-task setups—especially MT-CSIRL and MT-CSIRL+LT—produce cs-rewards that transfer to unseen targets and tasks, with strong correlations to ground-truth signals in qualitative analyses. The work provides empirical evidence on Meta-World with synthetic cs-rewards and offers curriculum learning and extensions to unknown task rewards, highlighting practical benefits for safer, better-aligned RL systems.
Abstract
One of the challenges in applying reinforcement learning in a complex real-world environment lies in providing the agent with a sufficiently detailed reward function. Any misalignment between the reward and the desired behavior can result in unwanted outcomes. This may lead to issues like "reward hacking" where the agent maximizes rewards by unintended behavior. In this work, we propose to disentangle the reward into two distinct parts. A simple task-specific reward, outlining the particulars of the task at hand, and an unknown common-sense reward, indicating the expected behavior of the agent within the environment. We then explore how this common-sense reward can be learned from expert demonstrations. We first show that inverse reinforcement learning, even when it succeeds in training an agent, does not learn a useful reward function. That is, training a new agent with the learned reward does not impair the desired behaviors. We then demonstrate that this problem can be solved by training simultaneously on multiple tasks. That is, multi-task inverse reinforcement learning can be applied to learn a useful reward function.
