Probe-Free Low-Rank Activation Intervention
Chonghe Jiang, Bao Nguyen, Anthony Man-Cho So, Viet Anh Nguyen
TL;DR
This work introduces FLORAIN, a probe-free, low-rank activation intervention for inference-time editing of transformer activations. By modeling the desirable output region as an ellipsoid and learning a nonlinear low-rank map, FLORAIN performs single-layer, head-wise interventions without training a probe classifier, achieving efficient optimization via a smooth objective with analytical projection under Mahalanobis distance. Empirical results on TruthfulQA show FLORAIN improves truthfulness and informativeness across multiple LMs and remains competitive with or superior to baselines like ITI and few-shot prompting, while maintaining limited distribution shift. The approach enables scalable, near real-time intervention with broad applicability to instruction-tuned models and suggests future work on richer region models and alternative projection schemes.
Abstract
Language models (LMs) can produce texts that appear accurate and coherent but contain untruthful or toxic content. Inference-time interventions that edit the hidden activations have shown promising results in steering the LMs towards desirable generations. Existing activation intervention methods often comprise an activation probe to detect undesirable generation, triggering the activation modification to steer subsequent generation. This paper proposes a probe-free intervention method FLORAIN for all attention heads in a specific activation layer. It eliminates the need to train classifiers for probing purposes. The intervention function is parametrized by a sample-wise nonlinear low-rank mapping, which is trained by minimizing the distance between the modified activations and their projection onto the manifold of desirable content. Under specific constructions of the manifold and projection distance, we show that the intervention strategy can be computed efficiently by solving a smooth optimization problem. The empirical results, benchmarked on multiple base models, demonstrate that FLORAIN consistently outperforms several baseline methods in enhancing model truthfulness and quality across generation and multiple-choice tasks.
