Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models
Gökdeniz Gülmez
TL;DR
Gabliteration advances neural weight modification by combining multi-directional refusal subspace extraction with ridge-regularized projections, adaptive layer selection, and layer-sensitive scaling to selectively modify harmful behaviors while preserving downstream performance. The approach extends prior abliteration by enabling partial, regularized projections onto a learned refusal subspace across selected layers, with theoretical guarantees on performance preservation and analyzed complexity. Empirical validation across 0.6B–32B models demonstrates effective behavioral alteration with minimal degradation to unrelated tasks, supported by ablation studies and comparisons to exact orthogonalization. The work also provides a comprehensive theoretical framework, including concentration bounds, error analyses, and optimality considerations for the proposed layer selection strategy. Overall, Gabliteration offers a scalable, principled methodology for safer, more controllable behavioral modification in large language models.
Abstract
We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.
