Table of Contents
Fetching ...

Steering LLMs for Culturally Localized Generation

Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, Rajiv Mathews, Lun Wang

Abstract

LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone. Notably, CuE-based steering is complementary to black-box localization methods, offering gains when applied on top of prompt-augmented inputs. This also suggests that models do benefit from better elicitation strategies, and don't necessarily lack long-tail knowledge representation, though this varies across cultures. Our results provide both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures.

Steering LLMs for Culturally Localized Generation

Abstract

LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone. Notably, CuE-based steering is complementary to black-box localization methods, offering gains when applied on top of prompt-augmented inputs. This also suggests that models do benefit from better elicitation strategies, and don't necessarily lack long-tail knowledge representation, though this varies across cultures. Our results provide both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures.
Paper Structure (45 sections, 21 equations, 12 figures, 9 tables)

This paper contains 45 sections, 21 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: We develop a framework for discovering and steering culture-specific features in LLMs. We extract SAE activations from cultural assertion data (Step 1), compute mutual information between feature activations and country labels to identify culturally salient dimensions, and aggregate them into country-level Cultural Embeddings (CuE) (Step 2). Using CuE, we analyze implicit cultural defaults under underspecified prompts by comparing generated responses to centered country prototypes (Step 3a). We then construct culture-specific steering vectors from CuE and intervene in the model’s residual stream to guide generation toward desired cultural targets with controllable strength (Step 3b). The steering strength enables dynamic control over output faithfulness, even when used with prompt-augmented inputs, as shown in the cultural faithfulness scale.
  • Figure 2: Heatmaps show cosine similarity between generated responses and country prototypes under Implicit, $\textsc{Steer}_{\textsc{Implicit}}$, Explicit, and $\textsc{Steer}_{\textsc{Explicit}}$ conditions. Implicit prompting concentrates alignment on Anglophone countries, while explicit prompting and steering progressively redistribute similarity mass toward target cultures and away from default cultural priors.
  • Figure 3: Mean cultural faithfulness and rarity scores (1--10 scale) across four settings---Implicit, $\textsc{Steer}_{\textsc{Implicit}}$, Explicit, and $\textsc{Steer}_{\textsc{Explicit}}$. Steered conditions consistently improve both metrics over their unsteered counterparts, with gains scaling with model capacity.
  • Figure 4: Pairwise win/tie/loss rates (%) comparing steered outputs against explicit prompting with an ensemble of judges. Green bars indicate steered wins; red bars indicate explicit prompting wins. Steering outperforms explicit prompting on both faithfulness and rarity across most configurations, with the strongest gains on Gemma-2-9B 16K and Llama-3.1-8B 32K.
  • Figure 5: Prompt template used to augment the CANDLE dataset with culturally specific but non-identifying assertions.
  • ...and 7 more figures