Steering LLMs for Culturally Localized Generation

Simran Khanuja; Hongbin Liu; Shujian Zhang; John Lambert; Mingqing Chen; Rajiv Mathews; Lun Wang

Steering LLMs for Culturally Localized Generation

Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, Rajiv Mathews, Lun Wang

Abstract

LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone. Notably, CuE-based steering is complementary to black-box localization methods, offering gains when applied on top of prompt-augmented inputs. This also suggests that models do benefit from better elicitation strategies, and don't necessarily lack long-tail knowledge representation, though this varies across cultures. Our results provide both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures.

Steering LLMs for Culturally Localized Generation

Abstract

Paper Structure (45 sections, 21 equations, 12 figures, 9 tables)

This paper contains 45 sections, 21 equations, 12 figures, 9 tables.

Introduction
Methodology
Step 0: Can models classify assertions into cultures?
Step 1: Feature Activation Extraction
Step 2: Constructing CuE
Step 3a: CuE for Bias Analysis
Step 3b: CuE for Steering
Experimental Setup
Results and Analysis
RQ1: Do LLMs encode culture-specific signals, and where are they located?
RQ2: What concepts do shared vs. unique features encode?
RQ3: What cultural defaults do LLMs exhibit in underspecified prompts and how does steering/prompting change this?
RQ4: Can CuE steer model outputs toward target cultures?
RQ5: How do steering effects vary across cultures?
Related Work
...and 30 more sections

Figures (12)

Figure 1: We develop a framework for discovering and steering culture-specific features in LLMs. We extract SAE activations from cultural assertion data (Step 1), compute mutual information between feature activations and country labels to identify culturally salient dimensions, and aggregate them into country-level Cultural Embeddings (CuE) (Step 2). Using CuE, we analyze implicit cultural defaults under underspecified prompts by comparing generated responses to centered country prototypes (Step 3a). We then construct culture-specific steering vectors from CuE and intervene in the model’s residual stream to guide generation toward desired cultural targets with controllable strength (Step 3b). The steering strength enables dynamic control over output faithfulness, even when used with prompt-augmented inputs, as shown in the cultural faithfulness scale.
Figure 2: Heatmaps show cosine similarity between generated responses and country prototypes under Implicit, $\textsc{Steer}_{\textsc{Implicit}}$, Explicit, and $\textsc{Steer}_{\textsc{Explicit}}$ conditions. Implicit prompting concentrates alignment on Anglophone countries, while explicit prompting and steering progressively redistribute similarity mass toward target cultures and away from default cultural priors.
Figure 3: Mean cultural faithfulness and rarity scores (1--10 scale) across four settings---Implicit, $\textsc{Steer}_{\textsc{Implicit}}$, Explicit, and $\textsc{Steer}_{\textsc{Explicit}}$. Steered conditions consistently improve both metrics over their unsteered counterparts, with gains scaling with model capacity.
Figure 4: Pairwise win/tie/loss rates (%) comparing steered outputs against explicit prompting with an ensemble of judges. Green bars indicate steered wins; red bars indicate explicit prompting wins. Steering outperforms explicit prompting on both faithfulness and rarity across most configurations, with the strongest gains on Gemma-2-9B 16K and Llama-3.1-8B 32K.
Figure 5: Prompt template used to augment the CANDLE dataset with culturally specific but non-identifying assertions.
...and 7 more figures

Steering LLMs for Culturally Localized Generation

Abstract

Steering LLMs for Culturally Localized Generation

Authors

Abstract

Table of Contents

Figures (12)