Table of Contents
Fetching ...

Sparse Activation Editing for Reliable Instruction Following in Narratives

Runcong Zhao, Chengyu Cao, Qinglin Zhu, Xiucheng Lv, Shun Shao, Lin Gui, Ruifeng Xu, Yulan He

TL;DR

This work tackles the instability of instruction following in narrative contexts by introducing Concise-SAE, a training-free framework that localises instruction-relevant neurons and applies precise, sparse edits to improve adherence without labelled data. It combines two components: (i) localisation via keyword-anchored semantic aggregation that robustly identifies instruction-correlated neurons from noisy contrastive pairs, and (ii) steering via Bayesian-optimised, bidirectional edits in a compact latent subspace to boost compliance while preserving fluency. A new benchmark, FreeInstruct, with 1,212 narrative-rich examples, evaluates models under adversarial or ambiguous prompts and demonstrates that Concise-SAE achieves state-of-the-art instruction adherence across multiple models and tasks, including safety benchmarks. The results show that leveraging a small set of supportive and opposing neurons, guided by principled aggregation and optimisation, yields reliable, training-free control of generation with practical implications for safer, more reliable LLM deployments in real-world narrative applications.

Abstract

Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.

Sparse Activation Editing for Reliable Instruction Following in Narratives

TL;DR

This work tackles the instability of instruction following in narrative contexts by introducing Concise-SAE, a training-free framework that localises instruction-relevant neurons and applies precise, sparse edits to improve adherence without labelled data. It combines two components: (i) localisation via keyword-anchored semantic aggregation that robustly identifies instruction-correlated neurons from noisy contrastive pairs, and (ii) steering via Bayesian-optimised, bidirectional edits in a compact latent subspace to boost compliance while preserving fluency. A new benchmark, FreeInstruct, with 1,212 narrative-rich examples, evaluates models under adversarial or ambiguous prompts and demonstrates that Concise-SAE achieves state-of-the-art instruction adherence across multiple models and tasks, including safety benchmarks. The results show that leveraging a small set of supportive and opposing neurons, guided by principled aggregation and optimisation, yields reliable, training-free control of generation with practical implications for safer, more reliable LLM deployments in real-world narrative applications.

Abstract

Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.

Paper Structure

This paper contains 33 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Examples of user inputs that deviate from intended instructions, challenging LLM agents' reliability and alignment.
  • Figure 2: Contrastive neuron identification. Given an instruction, we prompt the LLM to generate a pair of stories—one that follows the instruction and one that violates it. A keyword token (e.g., “realistic”) summarising the instruction is appended to each input, and its residual representation $\mathbf{h}_{\star}$ is extracted from a target LLM layer. These are encoded via an SAE to obtain sparse vectors $\mathbf{z}_{\star}$, which are used to rank neurons based on how consistently they differentiate between positive and negative examples, using the metric defined in Equation \ref{['eq:difference']}.
  • Figure 3: Overview of the FreeInstruct data construction process. The boxed components represent the final structure of each FreeInstruct example: (story, normal input, adversarial input, expected output).
  • Figure 4: Pairwise cosine similarity between neurons selected for steering. The supportive and opposing groups are internally coherent but mutually orthogonal, justifying the need to include both directions for effective control over generation.
  • Figure 5: Edit Direction and Strength. (a) Single-direction edits miss complementary control from opposing neurons in distinct subspaces. (b) Excessive strength degrades output; our method learns it automatically.
  • ...and 1 more figures