Table of Contents
Fetching ...

Skill Path: Unveiling Language Skills from Circuit Graphs

Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

TL;DR

The paper tackles mechanistic interpretability of language models by addressing flaws in circuit graphs that conflate multiple skills and obscure causal pathways. It introduces Skill Paths, a three step pipeline—Decomposition, Pruning, and Post-pruning Causal Mediation—that delivers a lossless linear decomposition of transformers into 29 functional components, enabling a complete linear representation $LM_l(X)=\sum_{i=0}^{28} C^i$. From this, it builds a Computation Graph $\mathcal{G}$, derives circuit graphs $\mathcal{G}^*$, and, via counterfactual interventions, yields a Skill Graph $\mathcal{G}^{S}$ that isolates target skills. The method validates two core conjectures, Stratification and Inclusiveness, by examining three language skills (Previous Token, Induction, ICL) and demonstrates that simpler skills reside in shallow layers while complex skills build on simpler ones, through quantitative analysis of path receivers and cross-skill overlaps. The results offer a causally grounded, modular view of language skills, with implications for interpretability, skill evolution analysis, and potential guidance for training dynamics.

Abstract

Circuit graph discovery has emerged as a fundamental approach to elucidating the skill mechanistic of language models. Despite the output faithfulness of circuit graphs, they suffer from atomic ablation, which causes the loss of causal dependencies between connected components. In addition, their discovery process, designed to preserve output faithfulness, inadvertently captures extraneous effects other than an isolated target skill. To alleviate these challenges, we introduce skill paths, which offers a more refined and compact representation by isolating individual skills within a linear chain of components. To enable skill path extracting from circuit graphs, we propose a three-step framework, consisting of decomposition, pruning, and post-pruning causal mediation. In particular, we offer a complete linear decomposition of the transformer model which leads to a disentangled computation graph. After pruning, we further adopt causal analysis techniques, including counterfactuals and interventions, to extract the final skill paths from the circuit graph. To underscore the significance of skill paths, we investigate three generic language skills-Previous Token Skill, Induction Skill, and In-Context Learning Skill-using our framework. Experiments support two crucial properties of these skills, namely stratification and inclusiveness.

Skill Path: Unveiling Language Skills from Circuit Graphs

TL;DR

The paper tackles mechanistic interpretability of language models by addressing flaws in circuit graphs that conflate multiple skills and obscure causal pathways. It introduces Skill Paths, a three step pipeline—Decomposition, Pruning, and Post-pruning Causal Mediation—that delivers a lossless linear decomposition of transformers into 29 functional components, enabling a complete linear representation . From this, it builds a Computation Graph , derives circuit graphs , and, via counterfactual interventions, yields a Skill Graph that isolates target skills. The method validates two core conjectures, Stratification and Inclusiveness, by examining three language skills (Previous Token, Induction, ICL) and demonstrates that simpler skills reside in shallow layers while complex skills build on simpler ones, through quantitative analysis of path receivers and cross-skill overlaps. The results offer a causally grounded, modular view of language skills, with implications for interpretability, skill evolution analysis, and potential guidance for training dynamics.

Abstract

Circuit graph discovery has emerged as a fundamental approach to elucidating the skill mechanistic of language models. Despite the output faithfulness of circuit graphs, they suffer from atomic ablation, which causes the loss of causal dependencies between connected components. In addition, their discovery process, designed to preserve output faithfulness, inadvertently captures extraneous effects other than an isolated target skill. To alleviate these challenges, we introduce skill paths, which offers a more refined and compact representation by isolating individual skills within a linear chain of components. To enable skill path extracting from circuit graphs, we propose a three-step framework, consisting of decomposition, pruning, and post-pruning causal mediation. In particular, we offer a complete linear decomposition of the transformer model which leads to a disentangled computation graph. After pruning, we further adopt causal analysis techniques, including counterfactuals and interventions, to extract the final skill paths from the circuit graph. To underscore the significance of skill paths, we investigate three generic language skills-Previous Token Skill, Induction Skill, and In-Context Learning Skill-using our framework. Experiments support two crucial properties of these skills, namely stratification and inclusiveness.
Paper Structure (41 sections, 8 equations, 14 figures, 10 tables)

This paper contains 41 sections, 8 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: The difference and correlation between skill path and circuit graph. We use the induction dataset as an example and present two types of samples of induction. When the induction dataset contains a certain number of samples related to arithmetic skills, the final circuit graph may contain parts of the paths for arithmetic skills in addition to the induction skill, as may those samples containing multi-choice skills or other potential skills. We verified in Appendix \ref{['suppObgdf']} that even cross-domain randomly sampled datasets still have the confounding of other skills.
  • Figure 2: A case text about causal effects.
  • Figure 3: T-sne visualization of 6 types of samples on vocabulary candidates. Red denotes the original output model ($\mathcal{G}$), while blue signifies the output once a corresponding skill path is removed ($\mathcal{G}-\mathcal{G}^{S}$). The outputs for the background text ($\mathcal{G}_{\text{Bkg}}$) and self text ($\mathcal{G}_{\text{Self}}$) are indicated in green and yellow, respectively.
  • Figure 4: Visualization of receivers distributed in layer1-10 in 3 increasingly-complex skills (PVT, IDT, and ICL1).
  • Figure 5: bisection clustering on paths with top 10% $Eff_{Skill}$ for 3 skills
  • ...and 9 more figures