Sparse Activation Editing for Reliable Instruction Following in Narratives
Runcong Zhao, Chengyu Cao, Qinglin Zhu, Xiucheng Lv, Shun Shao, Lin Gui, Ruifeng Xu, Yulan He
TL;DR
This work tackles the instability of instruction following in narrative contexts by introducing Concise-SAE, a training-free framework that localises instruction-relevant neurons and applies precise, sparse edits to improve adherence without labelled data. It combines two components: (i) localisation via keyword-anchored semantic aggregation that robustly identifies instruction-correlated neurons from noisy contrastive pairs, and (ii) steering via Bayesian-optimised, bidirectional edits in a compact latent subspace to boost compliance while preserving fluency. A new benchmark, FreeInstruct, with 1,212 narrative-rich examples, evaluates models under adversarial or ambiguous prompts and demonstrates that Concise-SAE achieves state-of-the-art instruction adherence across multiple models and tasks, including safety benchmarks. The results show that leveraging a small set of supportive and opposing neurons, guided by principled aggregation and optimisation, yields reliable, training-free control of generation with practical implications for safer, more reliable LLM deployments in real-world narrative applications.
Abstract
Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.
