Table of Contents
Fetching ...

Nuance Matters: Probing Epistemic Consistency in Causal Reasoning

Shaobo Cui, Junyou Li, Luca Mouchel, Yiyang Feng, Boi Faltings

TL;DR

The concept of causal epistemic consistency is introduced, which focuses on the self-consistency of Large Language Models (LLMs) in differentiating intermediates with nuanced differences in causal reasoning.

Abstract

To address this gap, our study introduces the concept of causal epistemic consistency, which focuses on the self-consistency of Large Language Models (LLMs) in differentiating intermediates with nuanced differences in causal reasoning. We propose a suite of novel metrics -- intensity ranking concordance, cross-group position agreement, and intra-group clustering -- to evaluate LLMs on this front. Through extensive empirical studies on 21 high-profile LLMs, including GPT-4, Claude3, and LLaMA3-70B, we have favoring evidence that current models struggle to maintain epistemic consistency in identifying the polarity and intensity of intermediates in causal reasoning. Additionally, we explore the potential of using internal token probabilities as an auxiliary tool to maintain causal epistemic consistency. In summary, our study bridges a critical gap in AI research by investigating the self-consistency over fine-grained intermediates involved in causal reasoning.

Nuance Matters: Probing Epistemic Consistency in Causal Reasoning

TL;DR

The concept of causal epistemic consistency is introduced, which focuses on the self-consistency of Large Language Models (LLMs) in differentiating intermediates with nuanced differences in causal reasoning.

Abstract

To address this gap, our study introduces the concept of causal epistemic consistency, which focuses on the self-consistency of Large Language Models (LLMs) in differentiating intermediates with nuanced differences in causal reasoning. We propose a suite of novel metrics -- intensity ranking concordance, cross-group position agreement, and intra-group clustering -- to evaluate LLMs on this front. Through extensive empirical studies on 21 high-profile LLMs, including GPT-4, Claude3, and LLaMA3-70B, we have favoring evidence that current models struggle to maintain epistemic consistency in identifying the polarity and intensity of intermediates in causal reasoning. Additionally, we explore the potential of using internal token probabilities as an auxiliary tool to maintain causal epistemic consistency. In summary, our study bridges a critical gap in AI research by investigating the self-consistency over fine-grained intermediates involved in causal reasoning.
Paper Structure (32 sections, 15 equations, 10 figures, 4 tables)

This paper contains 32 sections, 15 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of the evaluation framework for causal epistemic consistency. The first step involves instructing LLMs to generate fine-grained intermediates that influence a given causal relationship differently. The second step requires LLMs to rank their own generations based on their causal nuance. Finally, the proposed metrics are used to assess the self-consistency between ranking and generation, i.e., the LLMs' causal epistemic consistency.
  • Figure 2: Illustration of the proposed metrics from three aspects: intensity (Section \ref{['sec:metrics:ranking']}), polarity (Section \ref{['sec:metrics:cgp']}), and clustering (Section \ref{['sec:metrics:clustering']}). These metrics measure the self-consistency of LLMs in generating and ranking supporting () and defeating () intermediates with varying intensities. Numbers , , ..., , indicate the intensity of the generated intermediates, with the lowest value () being the strongest generated defeater and the highest value () the strongest generated supporter.
  • Figure 3: Radar charts comparing the performance of various LLM architectures and sizes (Gemma, LLaMA2, Phi-3, and LLaMA3) in maintaining causal epistemic consistency. Each color of the radar plot lines represents a different model size.
  • Figure 4: Visualization of LLaMA3-70B's (left) and GPT-4o's (right) alignment of intermediates' predicted ranking versus their generation phase ranking, indicating the models' self-consistency in intensity, polarity, and clustering. Each matrix element $(i,j)$ indicates the percentage of instances where an intermediate ranked at position $i$ during the generation phase was ranked at position $j$ during the ranking phase. For example, (, ) indicates the percentage of instances with a label of defeater with an intensity of 3 in the generation phase that was ranked as the supporter with an intensity of 4 during the ranking phase.
  • Figure 5: Impact of various conjunction words on the causal epistemic consistency across different LLMs. The x-axes categorize conjunction words into coordinating conjunctions, subordinate conjunctions, and conjunctive adverbs. The y-axes display values for causal epistemic consistency metrics. The analysis encompasses diverse model types (distinguished by marker color and shape) at different scales (represented by line thickness and marker size).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 1: Causal epistemic consistency