Explanation Design in Strategic Learning: Sufficient Explanations that Induce Non-harmful Responses
Kiet Q. H. Vo, Siu Lun Chau, Masahiro Kato, Yixin Wang, Krikamol Muandet
TL;DR
This work tackles safe explanation design when full disclosure of a DM's predictive model is infeasible. It introduces ARexes (action-recommendation-based explanations) and proves a necessary condition for surrogate explanations to avoid harming agents, plus a conditional-homogeneity assumption under which ARexes are sufficient to induce non-harmful responses. The authors provide a practical Joint-Opt learning procedure to jointly optimise the predictive model $g$ and the ARex policy $\sigma$, demonstrated on synthetic data and the German credit dataset, showing improved predictive performance while preserving agent welfare. The results establish a principled framework for safe partial model disclosure in strategic settings and point to extensions to dynamic environments and richer agent models.
Abstract
We study explanation design in algorithmic decision making with strategic agents, individuals who may modify their inputs in response to explanations of a decision maker's (DM's) predictive model. As the demand for transparent algorithmic systems continues to grow, most prior work assumes full model disclosure as the default solution. In practice, however, DMs such as financial institutions typically disclose only partial model information via explanations. Such partial disclosure can lead agents to misinterpret the model and take actions that unknowingly harm their utility. A key open question is how DMs can communicate explanations in a way that avoids harming strategic agents, while still supporting their own decision-making goals, e.g., minimising predictive error. In this work, we analyse well-known explanation methods, and establish a necessary condition to prevent explanations from misleading agents into self-harming actions. Moreover, with a conditional homogeneity assumption, we prove that action recommendation-based explanations (ARexes) are sufficient for non-harmful responses, mirroring the revelation principle in information design. To demonstrate how ARexes can be operationalised in practice, we propose a simple learning procedure that jointly optimises the predictive model and explanation policy. Experiments on synthetic and real-world tasks show that ARexes allow the DM to optimise their model's predictive performance while preserving agents' utility, offering a more refined strategy for safe and effective partial model disclosure.
