Table of Contents
Fetching ...

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang

TL;DR

This work tackles the challenge of producing model-agnostic natural-language explanations for agent decisions by training an Explanation LLM with rewards generated by a rectified flow model. The key idea is to use per-sentence reward signals derived from how well a third-party auditor (Guidance LLM) infers the actual decision, while denoising these rewards via a cross-attention–augmented rectified flow model embedded in the LLM. Empirical results across RL (SMAC) and LLM benchmarks (MMLU, MathQA) show the method outperforming supervised fine-tuning and RLHF baselines by notable margins and robustly generalizing to negative samples; ablations confirm the necessity of the flow and cross-attention components. Overall, the approach reduces dependence on expensive human feedback and offers a scalable path to high-quality, trustworthy explanations that improve decision interpretability and reliability in diverse settings.

Abstract

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain their policies in natural language will be vital for reliable coexistence. In this paper, we build a model-agnostic explanation generator based on an LLM. The technical novelty is that the rewards for training this LLM are generated by a generative flow matching model. This model has a specially designed structure with a hidden layer merged with an LLM to harness the linguistic cues of explanations into generating appropriate rewards. Experiments on both RL and LLM tasks demonstrate that our method can generate dense and effective rewards while saving on expensive human feedback; it thus enables effective explanations and even improves the accuracy of the decisions in original tasks.

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

TL;DR

This work tackles the challenge of producing model-agnostic natural-language explanations for agent decisions by training an Explanation LLM with rewards generated by a rectified flow model. The key idea is to use per-sentence reward signals derived from how well a third-party auditor (Guidance LLM) infers the actual decision, while denoising these rewards via a cross-attention–augmented rectified flow model embedded in the LLM. Empirical results across RL (SMAC) and LLM benchmarks (MMLU, MathQA) show the method outperforming supervised fine-tuning and RLHF baselines by notable margins and robustly generalizing to negative samples; ablations confirm the necessity of the flow and cross-attention components. Overall, the approach reduces dependence on expensive human feedback and offers a scalable path to high-quality, trustworthy explanations that improve decision interpretability and reliability in diverse settings.

Abstract

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain their policies in natural language will be vital for reliable coexistence. In this paper, we build a model-agnostic explanation generator based on an LLM. The technical novelty is that the rewards for training this LLM are generated by a generative flow matching model. This model has a specially designed structure with a hidden layer merged with an LLM to harness the linguistic cues of explanations into generating appropriate rewards. Experiments on both RL and LLM tasks demonstrate that our method can generate dense and effective rewards while saving on expensive human feedback; it thus enables effective explanations and even improves the accuracy of the decisions in original tasks.

Paper Structure

This paper contains 24 sections, 10 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of our method. (Left) We prompt an Explanation LLM to generate reasoning about an agent decision based on the context information. Our focus is on whether a third party can infer the actual decision from this explanation. (Middle) We employ a rectified flow model $\varphi$ to generate a probability distribution $\hat{p}$ over possible decisions, according to how likely they appear as a plausible outcome after each sentence of the explanation. Per-sentence rewards for training the Explanation LLM are the changes in the probability of the actual decision (highlighted in blue). (Right) The architecture and training of the rectified flow $\varphi$ are based on a Guidance LLM. The Guidance LLM provides positive samples, where, with the context and explanation as input, it can produce a distribution $p$ that assigns the highest probability to the actual decision. The rectified flow $\varphi$ is trained to produce such distributions $p$, with a cross-attention layer in its middle that selectively leverages information from the Guidance LLM input, enabling generalization to negative samples.
  • Figure 2: Accuracy of the rectified model $\varphi$ on unseen test samples, shown as the percentage of samples for which $\varphi$ reproduces the correct decisions. Left: Accuracy on positive samples (where the Guidance LLM is correct). Right: Accuracy on negative samples.
  • Figure 3: Accuracy of the Explanation LLM increases through each training round.