A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models

Marc Braun; Jenny Kunz

A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models

Marc Braun, Jenny Kunz

TL;DR

This work addresses the faithfulness of self-rationalising explanations produced by large language models by proposing a hypothesis-driven surrogate framework. It builds a Bayesian-network surrogate (SSM) from a hypothetical global explanation (HGE) of how a task like natural language inference is solved, then derives NLEs from the SSM and compares them to GPT-3.5 explanations using both human and automatic evaluations. The study finds only modest alignment between the SSMs and GPT-3.5, with the smaller, more inductively biased SSM performing better than the larger one, suggesting the need for refined hypotheses and surrogate designs. Overall, the framework provides a transparent methodology to test hypotheses about LLM reasoning and points to concrete directions for improving faithfulness and surrogate-model construction in future work.

Abstract

The self-rationalising capabilities of LLMs are appealing because the generated explanations can give insights into the plausibility of the predictions. However, how faithful the explanations are to the predictions is questionable, raising the need to explore the patterns behind them further. To this end, we propose a hypothesis-driven statistical framework. We use a Bayesian network to implement a hypothesis about how a task (in our example, natural language inference) is solved, and its internal states are translated into natural language with templates. Those explanations are then compared to LLM-generated free-text explanations using automatic and human evaluations. This allows us to judge how similar the LLM's and the Bayesian network's decision processes are. We demonstrate the usage of our framework with an example hypothesis and two realisations in Bayesian networks. The resulting models do not exhibit a strong similarity to GPT-3.5. We discuss the implications of this as well as the framework's potential to approximate LLM decisions better in future work.

A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models

TL;DR

Abstract

Paper Structure (49 sections, 15 equations, 3 figures, 3 tables)

This paper contains 49 sections, 15 equations, 3 figures, 3 tables.

Introduction
Related Work
Proposed Framework
Constructing the SSM
Extracting Subphrases from Premise and Hypothesis
Defining the Structure of the SSM
Defining the input variables $X$.
Introducing hidden variables $Z$
Defining $Z$ for the large SSM
Defining $Z$ for the small SSM
Defining the output $Y$
Determining the Parameters of the SSM
Defining the Deterministic Distribution of Y|Z
Contradiction Condition
Entailment Condition
...and 34 more sections

Figures (3)

Figure 1: An illustrative (simplified) example for the small SSM. The input $X$ consists of the subphrases of the premise and hypothesis. The circles are the hidden variables $Z$, followed by the final prediction $Y$ (here, contradiction) and a template-based NLE (lowest box).
Figure 2: Relationship of any $z_{k,l} \in \mathcal{Z}$ to its parents
Figure 3: Structure of the $\text{SSM}_{large}$ expressed as a Bayesian Network

A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models

TL;DR

Abstract

A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)