The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

Damjan Kalajdzievski

The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

Damjan Kalajdzievski

TL;DR

The paper tackles programmable, interpretable control of transformer generation by embedding neuro-symbolic logic through concept vectors. It formalizes a conditional circuit, $P\rightarrow Q$, using sensing vectors $p$ and steering vectors $q$ to modify internal activations with minimal fine-tuning, including a mergeable variant m-LIMS for easy deployment. Empirically, LIMS demonstrates data-efficient behavior steering across hallucination, unanswerable questions, safety, and chain-of-thought tasks, often rivaling or surpassing 10-shot prompts while using far fewer training examples and memory than reinforcement-learning-based alternatives. The approach offers a scalable, transparent framework for modular reasoning in LLMs, with potential for multi-task, cross-domain, and multi-modal extensions, while leaving room for exploring layer selection, richer logic, and more complex neuro-symbolic compositions.

Abstract

The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

TL;DR

The paper tackles programmable, interpretable control of transformer generation by embedding neuro-symbolic logic through concept vectors. It formalizes a conditional circuit,

, using sensing vectors

and steering vectors

to modify internal activations with minimal fine-tuning, including a mergeable variant m-LIMS for easy deployment. Empirically, LIMS demonstrates data-efficient behavior steering across hallucination, unanswerable questions, safety, and chain-of-thought tasks, often rivaling or surpassing 10-shot prompts while using far fewer training examples and memory than reinforcement-learning-based alternatives. The approach offers a scalable, transparent framework for modular reasoning in LLMs, with potential for multi-task, cross-domain, and multi-modal extensions, while leaving room for exploring layer selection, richer logic, and more complex neuro-symbolic compositions.

The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

TL;DR

Abstract

The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (7)