Table of Contents
Fetching ...

Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to Semantics

Isabelle Lee, Joshua Lum, Ziyi Liu, Dani Yogatama

TL;DR

These findings indicate that causal syntax is localized within the first 2-3 layers, while certain heads in later layers exhibit heightened sensitivity to nonsensical variations of causal sentences, suggesting that models may infer reasoning by detecting syntactic cues and isolating distinct heads in the final layers that focus on semantic relationships.

Abstract

While interpretability research has shed light on some internal algorithms utilized by transformer-based LLMs, reasoning in natural language, with its deep contextuality and ambiguity, defies easy categorization. As a result, formulating clear and motivating questions for circuit analysis that rely on well-defined in-domain and out-of-domain examples required for causal interventions is challenging. Although significant work has investigated circuits for specific tasks, such as indirect object identification (IOI), deciphering natural language reasoning through circuits remains difficult due to its inherent complexity. In this work, we take initial steps to characterize causal reasoning in LLMs by analyzing clear-cut cause-and-effect sentences like "I opened an umbrella because it started raining," where causal interventions may be possible through carefully crafted scenarios using GPT-2 small. Our findings indicate that causal syntax is localized within the first 2-3 layers, while certain heads in later layers exhibit heightened sensitivity to nonsensical variations of causal sentences. This suggests that models may infer reasoning by (1) detecting syntactic cues and (2) isolating distinct heads in the final layers that focus on semantic relationships.

Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to Semantics

TL;DR

These findings indicate that causal syntax is localized within the first 2-3 layers, while certain heads in later layers exhibit heightened sensitivity to nonsensical variations of causal sentences, suggesting that models may infer reasoning by detecting syntactic cues and isolating distinct heads in the final layers that focus on semantic relationships.

Abstract

While interpretability research has shed light on some internal algorithms utilized by transformer-based LLMs, reasoning in natural language, with its deep contextuality and ambiguity, defies easy categorization. As a result, formulating clear and motivating questions for circuit analysis that rely on well-defined in-domain and out-of-domain examples required for causal interventions is challenging. Although significant work has investigated circuits for specific tasks, such as indirect object identification (IOI), deciphering natural language reasoning through circuits remains difficult due to its inherent complexity. In this work, we take initial steps to characterize causal reasoning in LLMs by analyzing clear-cut cause-and-effect sentences like "I opened an umbrella because it started raining," where causal interventions may be possible through carefully crafted scenarios using GPT-2 small. Our findings indicate that causal syntax is localized within the first 2-3 layers, while certain heads in later layers exhibit heightened sensitivity to nonsensical variations of causal sentences. This suggests that models may infer reasoning by (1) detecting syntactic cues and (2) isolating distinct heads in the final layers that focus on semantic relationships.

Paper Structure

This paper contains 17 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overview of methods.
  • Figure 2: Proportion of Effect-to-Cause or Cause-to-Effect Attention.
  • Figure 3: Attention head out activation patching results over all positions. O = Object, L = location, S = So, B = Because
  • Figure 4: Proportion of Attention Paid to Causal Delimiters.
  • Figure 5: Logit analysis with per layer loss.
  • ...and 5 more figures