Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
TL;DR
The paper tackles undesired OOD generalization in fine-tuned LLMs by introducing Concept Ablation Fine-Tuning (CAFT), which identifies misaligned latent directions via interpretability tools and suppresses them through projection during fine-tuning without modifying training data. CAFT is operationalized through two directions-identification approaches—PCA on activation differences and sparse autoencoders (SAEs)—and is shown to dramatically reduce emergent misalignment (up to 10x) while preserving in-distribution performance. It also improves robustness to spurious correlations in two multiple-choice tasks, achieving substantial gains in OOD accuracy with interpreted latents, and demonstrates competitive baselines against random or top-latent ablations. The work suggests a practical, data-free route to steering LLM generalization during fine-tuning, with implications for safer deployment and potential scalability to larger frontier models.
Abstract
Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.
