Table of Contents
Fetching ...

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz

TL;DR

This work proposes Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts, and combines task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery.

Abstract

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ($n{=}15$ paired runs), CCG achieves $\CFS=5.654\pm0.625$, outperforming ROME-style tracing ($3.382\pm0.233$), SAE-only ranking ($2.479\pm0.196$), and a random baseline ($1.032\pm0.034$), with $p<0.0001$ after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

TL;DR

This work proposes Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts, and combines task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery.

Abstract

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ( paired runs), CCG achieves , outperforming ROME-style tracing (), SAE-only ranking (), and a random baseline (), with after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.
Paper Structure (25 sections, 10 equations, 8 figures, 4 tables)

This paper contains 25 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: CCG pipeline.Stage 1: task-conditioned SAE on GPT-2 Medium residual activations (Layer 12) with TopK gating ($K{=}256$, $k{=}13$; 5.1% L0). Stage 2: DAGMA learns a sparse DAG over the top-64 concepts per domain. Stage 3: CFS evaluates intervention faithfulness ($\mathrm{CFS}{=}5.654$; $p{<}0.0001$ vs. baselines).
  • Figure 2: Dataset prompt lengths. Word-count histograms for ARC-Challenge (left; mean 22.6), StrategyQA (middle; mean 9.6), and LogiQA (right; near-zero due to separate context fields). We train SAEs and CCGs per dataset.
  • Figure 3: SAE training curves. Reconstruction MSE decreases. L1 sparsity and $\beta$-loss increase (centre-left). L0 activation rate converges to 5.1% with TopK=13 (centre-right), avoiding the broken 92% regime.
  • Figure 4: Learned CCG topologies. Top-20 nodes (degree centrality) and top-30 edges (weight) for ARC (left; 226 edges, 5.5%), StrategyQA (middle; 260 edges, 6.3%; hubs C18/C40/C22), and LogiQA (right; 234 edges, 5.7%; chain-like). Labels denote SAE concept indices.
  • Figure 5: Main results. Mean CFS $\pm$ 1 std over five seeds for each method and dataset. The dashed line marks random chance (CFS$=1.0$). CCG consistently outperforms ROME, SAE-only, and Random; values are in Table \ref{['tab:main']}.
  • ...and 3 more figures