Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards
Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang
TL;DR
This work tackles the challenge of producing model-agnostic natural-language explanations for agent decisions by training an Explanation LLM with rewards generated by a rectified flow model. The key idea is to use per-sentence reward signals derived from how well a third-party auditor (Guidance LLM) infers the actual decision, while denoising these rewards via a cross-attention–augmented rectified flow model embedded in the LLM. Empirical results across RL (SMAC) and LLM benchmarks (MMLU, MathQA) show the method outperforming supervised fine-tuning and RLHF baselines by notable margins and robustly generalizing to negative samples; ablations confirm the necessity of the flow and cross-attention components. Overall, the approach reduces dependence on expensive human feedback and offers a scalable path to high-quality, trustworthy explanations that improve decision interpretability and reliability in diverse settings.
Abstract
As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain their policies in natural language will be vital for reliable coexistence. In this paper, we build a model-agnostic explanation generator based on an LLM. The technical novelty is that the rewards for training this LLM are generated by a generative flow matching model. This model has a specially designed structure with a hidden layer merged with an LLM to harness the linguistic cues of explanations into generating appropriate rewards. Experiments on both RL and LLM tasks demonstrate that our method can generate dense and effective rewards while saving on expensive human feedback; it thus enables effective explanations and even improves the accuracy of the decisions in original tasks.
