Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Md Muntaqim Meherab; Noor Islam S. Mohammad; Faiza Feroz

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz

TL;DR

This work proposes Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts, and combines task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery.

Abstract

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ($n{=}15$ paired runs), CCG achieves $\CFS=5.654\pm0.625$, outperforming ROME-style tracing ($3.382\pm0.233$), SAE-only ranking ($2.479\pm0.196$), and a random baseline ($1.032\pm0.034$), with $p<0.0001$ after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

TL;DR

Abstract

paired runs), CCG achieves

, outperforming ROME-style tracing (

), SAE-only ranking (

), and a random baseline (

), with

after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

Paper Structure (25 sections, 10 equations, 8 figures, 4 tables)

This paper contains 25 sections, 10 equations, 8 figures, 4 tables.

Introduction
Contributions.
Related Work
Mechanistic interpretability.
Methodology
Stage 1: Task-Conditioned Sparse Autoencoder
Neuron resampling.
Task conditioning.
Stage 2: Causal Concept Graph Learning
Stage 3: Causal Fidelity Score
Experimental Setup
SAE Training and Concept Quality
CCG Training and Graph Structure
Main Results: Causal Fidelity Score
Statistical Significance
...and 10 more sections

Figures (8)

Figure 1: CCG pipeline.Stage 1: task-conditioned SAE on GPT-2 Medium residual activations (Layer 12) with TopK gating ($K{=}256$, $k{=}13$; 5.1% L0). Stage 2: DAGMA learns a sparse DAG over the top-64 concepts per domain. Stage 3: CFS evaluates intervention faithfulness ($\mathrm{CFS}{=}5.654$; $p{<}0.0001$ vs. baselines).
Figure 2: Dataset prompt lengths. Word-count histograms for ARC-Challenge (left; mean 22.6), StrategyQA (middle; mean 9.6), and LogiQA (right; near-zero due to separate context fields). We train SAEs and CCGs per dataset.
Figure 3: SAE training curves. Reconstruction MSE decreases. L1 sparsity and $\beta$-loss increase (centre-left). L0 activation rate converges to 5.1% with TopK=13 (centre-right), avoiding the broken 92% regime.
Figure 4: Learned CCG topologies. Top-20 nodes (degree centrality) and top-30 edges (weight) for ARC (left; 226 edges, 5.5%), StrategyQA (middle; 260 edges, 6.3%; hubs C18/C40/C22), and LogiQA (right; 234 edges, 5.7%; chain-like). Labels denote SAE concept indices.
Figure 5: Main results. Mean CFS $\pm$ 1 std over five seeds for each method and dataset. The dashed line marks random chance (CFS$=1.0$). CCG consistently outperforms ROME, SAE-only, and Random; values are in Table \ref{['tab:main']}.
...and 3 more figures

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

TL;DR

Abstract

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)