Reward Distance Comparisons Under Transition Sparsity
Clement Nyanhongo, Bruno Miranda Henrique, Eugene Santos
TL;DR
This work tackles the challenge of comparing reward functions without relying on policy learning by introducing the Sparsity Resilient Reward Distance (SRRD), a direct reward comparison pseudometric designed for highly sparse transition data. SRRD blends canonicalization concepts from existing methods with additional reward-expectation terms that leverage observed sample distributions, achieving policy-invariant comparisons even when full transition coverage is unavailable. The authors provide theoretical robustness results via Relative Shaping Errors and a regret-bound framework, and empirically validate SRRD against EPIC and DARD across Gridworld, Bouncing Balls, Drone Combat, StarCraft II, Robomimic, Montezuma’s Revenge, StarCraft II, and MIMIC-IV domains. They demonstrate SRRD’s superior performance under transition sparsity and its effectiveness as a distance measure for classifying agent behaviors through IRL-derived rewards, with potential to accelerate IRL workflows and improve reward evaluation. The work points to future extensions that address non-potential shaping, scaling to neural-reward representations, and multicriteria invariance to support broader applicability in real-world reward modeling tasks.
Abstract
Reward comparisons are vital for evaluating differences in agent behaviors induced by a set of reward functions. Most conventional techniques utilize the input reward functions to learn optimized policies, which are then used to compare agent behaviors. However, learning these policies can be computationally expensive and can also raise safety concerns. Direct reward comparison techniques obviate policy learning but suffer from transition sparsity, where only a small subset of transitions are sampled due to data collection challenges and feasibility constraints. Existing state-of-the-art direct reward comparison methods are ill-suited for these sparse conditions since they require high transition coverage, where the majority of transitions from a given coverage distribution are sampled. When this requirement is not satisfied, a distribution mismatch between sampled and expected transitions can occur, leading to significant errors. This paper introduces the Sparsity Resilient Reward Distance (SRRD) pseudometric, designed to eliminate the need for high transition coverage by accommodating diverse sample distributions, which are common under transition sparsity. We provide theoretical justification for SRRD's robustness and conduct experiments to demonstrate its practical efficacy across multiple domains.
