Table of Contents
Fetching ...

How Ambiguous Are the Rationales for Natural Language Reasoning? A Simple Approach to Handling Rationale Uncertainty

Hazel H. Kim

TL;DR

The paper tackles the problem of how ambiguity in rationales affects natural language reasoning in language models. It introduces an entropy-based framework to quantify rationale uncertainty from model priors and posteriors and proposes AURA, a two-system reasoning method that first learns from all data and then concentrates on high-entropy, ambiguous rationales. Empirical results across five datasets show that AURA yields robust improvements, particularly in out-of-distribution and low-resource scenarios, often outperforming prior pipeline rationalization methods. The work demonstrates that prioritizing robust reasoning mechanisms and explicitly handling rationale uncertainty can enhance practical NL reasoning when perfect rationales are unattainable.

Abstract

The quality of rationales is essential in the reasoning capabilities of language models. Rationales not only enhance reasoning performance in complex natural language tasks but also justify model decisions. However, obtaining impeccable rationales is often impossible. Our study aims to investigate how ambiguous rationales play in model performances of natural language reasoning. We first assess the ambiguity of rationales through the lens of entropy and uncertainty in model prior beliefs, exploring its impact on task performance. We then propose a simple way to guide models to choose between two different reasoning paths depending on the ambiguity of the rationales. Our empirical results demonstrate that this approach leads to robust performance, particularly in adversarial scenarios where rationale quality is inconsistent.

How Ambiguous Are the Rationales for Natural Language Reasoning? A Simple Approach to Handling Rationale Uncertainty

TL;DR

The paper tackles the problem of how ambiguity in rationales affects natural language reasoning in language models. It introduces an entropy-based framework to quantify rationale uncertainty from model priors and posteriors and proposes AURA, a two-system reasoning method that first learns from all data and then concentrates on high-entropy, ambiguous rationales. Empirical results across five datasets show that AURA yields robust improvements, particularly in out-of-distribution and low-resource scenarios, often outperforming prior pipeline rationalization methods. The work demonstrates that prioritizing robust reasoning mechanisms and explicitly handling rationale uncertainty can enhance practical NL reasoning when perfect rationales are unattainable.

Abstract

The quality of rationales is essential in the reasoning capabilities of language models. Rationales not only enhance reasoning performance in complex natural language tasks but also justify model decisions. However, obtaining impeccable rationales is often impossible. Our study aims to investigate how ambiguous rationales play in model performances of natural language reasoning. We first assess the ambiguity of rationales through the lens of entropy and uncertainty in model prior beliefs, exploring its impact on task performance. We then propose a simple way to guide models to choose between two different reasoning paths depending on the ambiguity of the rationales. Our empirical results demonstrate that this approach leads to robust performance, particularly in adversarial scenarios where rationale quality is inconsistent.
Paper Structure (22 sections, 6 equations, 4 figures, 2 tables)

This paper contains 22 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The huge performance mismatch between training and validation sets by epochs.
  • Figure 3: An example of reasoning task on generated rationales. The rationalizing LM generates rationales of the answer choices and the reasoning LM predicts answers on the given question, answer choices, and the generated rationales.
  • Figure 4: Performance changes depending on different training ratios. The performance of in-distribution settings with (a) -- (d) and of out-of-distribution settings with (e) -- (f).
  • Figure 5: Performance gains by AURA depending on training and testing rationales. M: machine-generated rationales, H: human-written rationales. The legend shows types of training rationales with or without AURA. Without AURA is standard training.