Table of Contents
Fetching ...

A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models

Marc Braun, Jenny Kunz

TL;DR

This work addresses the faithfulness of self-rationalising explanations produced by large language models by proposing a hypothesis-driven surrogate framework. It builds a Bayesian-network surrogate (SSM) from a hypothetical global explanation (HGE) of how a task like natural language inference is solved, then derives NLEs from the SSM and compares them to GPT-3.5 explanations using both human and automatic evaluations. The study finds only modest alignment between the SSMs and GPT-3.5, with the smaller, more inductively biased SSM performing better than the larger one, suggesting the need for refined hypotheses and surrogate designs. Overall, the framework provides a transparent methodology to test hypotheses about LLM reasoning and points to concrete directions for improving faithfulness and surrogate-model construction in future work.

Abstract

The self-rationalising capabilities of LLMs are appealing because the generated explanations can give insights into the plausibility of the predictions. However, how faithful the explanations are to the predictions is questionable, raising the need to explore the patterns behind them further. To this end, we propose a hypothesis-driven statistical framework. We use a Bayesian network to implement a hypothesis about how a task (in our example, natural language inference) is solved, and its internal states are translated into natural language with templates. Those explanations are then compared to LLM-generated free-text explanations using automatic and human evaluations. This allows us to judge how similar the LLM's and the Bayesian network's decision processes are. We demonstrate the usage of our framework with an example hypothesis and two realisations in Bayesian networks. The resulting models do not exhibit a strong similarity to GPT-3.5. We discuss the implications of this as well as the framework's potential to approximate LLM decisions better in future work.

A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models

TL;DR

This work addresses the faithfulness of self-rationalising explanations produced by large language models by proposing a hypothesis-driven surrogate framework. It builds a Bayesian-network surrogate (SSM) from a hypothetical global explanation (HGE) of how a task like natural language inference is solved, then derives NLEs from the SSM and compares them to GPT-3.5 explanations using both human and automatic evaluations. The study finds only modest alignment between the SSMs and GPT-3.5, with the smaller, more inductively biased SSM performing better than the larger one, suggesting the need for refined hypotheses and surrogate designs. Overall, the framework provides a transparent methodology to test hypotheses about LLM reasoning and points to concrete directions for improving faithfulness and surrogate-model construction in future work.

Abstract

The self-rationalising capabilities of LLMs are appealing because the generated explanations can give insights into the plausibility of the predictions. However, how faithful the explanations are to the predictions is questionable, raising the need to explore the patterns behind them further. To this end, we propose a hypothesis-driven statistical framework. We use a Bayesian network to implement a hypothesis about how a task (in our example, natural language inference) is solved, and its internal states are translated into natural language with templates. Those explanations are then compared to LLM-generated free-text explanations using automatic and human evaluations. This allows us to judge how similar the LLM's and the Bayesian network's decision processes are. We demonstrate the usage of our framework with an example hypothesis and two realisations in Bayesian networks. The resulting models do not exhibit a strong similarity to GPT-3.5. We discuss the implications of this as well as the framework's potential to approximate LLM decisions better in future work.
Paper Structure (49 sections, 15 equations, 3 figures, 3 tables)

This paper contains 49 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustrative (simplified) example for the small SSM. The input $X$ consists of the subphrases of the premise and hypothesis. The circles are the hidden variables $Z$, followed by the final prediction $Y$ (here, contradiction) and a template-based NLE (lowest box).
  • Figure 2: Relationship of any $z_{k,l} \in \mathcal{Z}$ to its parents
  • Figure 3: Structure of the $\text{SSM}_{large}$ expressed as a Bayesian Network