Table of Contents
Fetching ...

The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

Damjan Kalajdzievski

TL;DR

The paper tackles programmable, interpretable control of transformer generation by embedding neuro-symbolic logic through concept vectors. It formalizes a conditional circuit, $P\rightarrow Q$, using sensing vectors $p$ and steering vectors $q$ to modify internal activations with minimal fine-tuning, including a mergeable variant m-LIMS for easy deployment. Empirically, LIMS demonstrates data-efficient behavior steering across hallucination, unanswerable questions, safety, and chain-of-thought tasks, often rivaling or surpassing 10-shot prompts while using far fewer training examples and memory than reinforcement-learning-based alternatives. The approach offers a scalable, transparent framework for modular reasoning in LLMs, with potential for multi-task, cross-domain, and multi-modal extensions, while leaving room for exploring layer selection, richer logic, and more complex neuro-symbolic compositions.

Abstract

The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

TL;DR

The paper tackles programmable, interpretable control of transformer generation by embedding neuro-symbolic logic through concept vectors. It formalizes a conditional circuit, , using sensing vectors and steering vectors to modify internal activations with minimal fine-tuning, including a mergeable variant m-LIMS for easy deployment. Empirically, LIMS demonstrates data-efficient behavior steering across hallucination, unanswerable questions, safety, and chain-of-thought tasks, often rivaling or surpassing 10-shot prompts while using far fewer training examples and memory than reinforcement-learning-based alternatives. The approach offers a scalable, transparent framework for modular reasoning in LLMs, with potential for multi-task, cross-domain, and multi-modal extensions, while leaving room for exploring layer selection, richer logic, and more complex neuro-symbolic compositions.

Abstract

The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

Paper Structure

This paper contains 29 sections, 2 theorems, 34 equations, 11 figures, 19 tables, 3 algorithms.

Key Result

Proposition 3.3

The probability of the approximate LIMS model behaving according to $Q$ on $S\subseteq D$ is where $\widehat{\text{LIMS}}$ is an approximate LIMS model with circuit added at the last input token position only.

Figures (11)

  • Figure 1: High-level overview of LIMS. Datasets contrasting the presence or absence of concepts $P$ and $Q$ are defined, and used to extract corresponding concept vectors $p$ and $q$. These vectors are used to construct a conditional implication circuit that activates behavior $Q$ when concept $P$ is present.
  • Figure 2: Performance on SQuAD 2 task. Accuracy in Rejecting insufficient information (top), and overall performance (bottom). Existing models fail to detect when context is insufficient for factual answers. Notably, LIMS with 500 examples ($81.4\%$) outperforms DPO with 20,000 ($81.3\%$). We included GPT-4o performance to provide context on task difficulty.
  • Figure 3: Probability heatmaps of LIMS components. The top heatmap shows that sensing is concentrated on the last few tokens, as expected. The left heatmap depicts decoupled sensing and steering performance across tasks, where low base model probabilities on $P$ indicate that LIMS components control task performance. The right heatmap compares full LIMS test performance (first and fourth columns) to predicted values computed from 100 training examples (columns 2, 3, 5, and 6), with 95% confidence intervals. The close match indicates that these component-based predictions reliably estimate test performance. See §\ref{['section:interp']} for a full description of the model components and predictive equations.
  • Figure 4: Pre-activations of concept sensing at the last token. Tasks appearing from top left to bottom right: HaluEval , SQuAD 2, AdvBench, and GSM8K. The red line is the threshold for classification $b_p$. This highlights a good degree of separation for classes, but that the sensing task for hallucinations is more difficult than detecting math or toxicity.
  • Figure 5: Steering and sensing concept vector accuracy and norm by layer across tasks. For consistency, accuracy for GSM8K steering is normalized to accuracy of the COT prompt. We see that although norms of concept vectors are maximized near the last layer, the norms initially peak near the middle layers of the model. Generally, the middle layers show the best compromise between sensing and steering accuracy across all datasets.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Definition 3.2
  • Proposition 3.3
  • Definition 1.1
  • Definition 1.2
  • Proposition 1.3
  • proof