Table of Contents
Fetching ...

Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models

Jiayi Zhang, Shu Yang, Junchao Wu, Derek F. Wong, Di Wang

TL;DR

This work uncovers neuron-level encodings of political stances in LLMs, revealing general political neurons that influence cross-topic ideology and topic-specific neurons that govern individual topics. It introduces PNLAC to locate these neurons via Activation Contrast and PAS, and demonstrates through activation patching that general neurons drive cross-topic shifts while topic-specific neurons control topic-specific stances. To mitigate undesired cross-topic generalization, the authors propose InhibitFT, which freezes general political neurons during fine-tuning and tunes only topic-specific neurons, achieving ~20% reduction in cross-topic coupling with minimal utility loss and showing effectiveness even when freezing as few as 5% of neurons. The approach provides a principled, interpretable method to control political stances in LLMs, with robust results across multiple models and datasets and practical implications for alignment and safety in open-ended generation tasks.

Abstract

Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model's political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.

Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models

TL;DR

This work uncovers neuron-level encodings of political stances in LLMs, revealing general political neurons that influence cross-topic ideology and topic-specific neurons that govern individual topics. It introduces PNLAC to locate these neurons via Activation Contrast and PAS, and demonstrates through activation patching that general neurons drive cross-topic shifts while topic-specific neurons control topic-specific stances. To mitigate undesired cross-topic generalization, the authors propose InhibitFT, which freezes general political neurons during fine-tuning and tunes only topic-specific neurons, achieving ~20% reduction in cross-topic coupling with minimal utility loss and showing effectiveness even when freezing as few as 5% of neurons. The approach provides a principled, interpretable method to control political stances in LLMs, with robust results across multiple models and datasets and practical implications for alignment and safety in open-ended generation tasks.

Abstract

Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model's political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.

Paper Structure

This paper contains 42 sections, 7 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: A slight fine-tune can lead to LLMs' broader political stance change. For example, as illustrated in this figure, fine-tuning a model with right leaning prompt on topic Race shifts the model's stance on broader topic Economy from left to right. The susceptibility of the stance can be generalized to unrelated topics.
  • Figure 2: The overview of our method. (a) Neurons are identified using PNLAC that computes activation score and devided into two types. (b) Verify the identified neurons. (c) InhibitFT: freeze general political neurons during fine-tuning to migrate the cross-topic coupling.
  • Figure 3: Distribution of Political Neurons in Llama-3.1-8B.
  • Figure 4: Political stance of patched model (Llama-3.1-8B).
  • Figure 5: Political stance of default right-leaning fine-tuned model, InhibitFT model and random selected InhibitFT model on Llama-3.1-8B.
  • ...and 7 more figures