Reducing Human-Robot Goal State Divergence with Environment Design
Kelsey Sikes, Sarah Keren, Sarath Sreedharan
TL;DR
This work introduces Goal State Divergence ($\mathcal{GSD}$) as a quantitative measure of mismatch between a human's expected robot goal state and the robot's actual final state, and formalizes the HRGA design problem to minimize this divergence through environment design. It develops bound-based approximations ($\mathcal{GD}^{\uparrow}$ and $\mathcal{GD}^{\downarrow}$) and a planning-based compilation to compute these bounds efficiently, enabling automatic identification of minimal environment modifications. The authors propose an inner-outer loop algorithm to search for the smallest set of design changes that enforce zero lower-bound divergence while maintaining an acceptable upper bound, and validate the approach on IPC-domain benchmarks, showing favorable performance over naive baselines. The results demonstrate the feasibility and efficiency of environment design for improving human–robot goal alignment and safety, with potential extensions to richer task specifications and dynamic environments.
Abstract
One of the most difficult challenges in creating successful human-AI collaborations is aligning a robot's behavior with a human user's expectations. When this fails to occur, a robot may misinterpret their specified goals, prompting it to perform actions with unanticipated, potentially dangerous side effects. To avoid this, we propose a new metric we call Goal State Divergence $\mathcal{(GSD)}$, which represents the difference between a robot's final goal state and the one a human user expected. In cases where $\mathcal{GSD}$ cannot be directly calculated, we show how it can be approximated using maximal and minimal bounds. We then input the $\mathcal{GSD}$ value into our novel human-robot goal alignment (HRGA) design problem, which identifies a minimal set of environment modifications that can prevent mismatches like this. To show the effectiveness of $\mathcal{GSD}$ for reducing differences between human-robot goal states, we empirically evaluate our approach on several standard benchmarks.
