Table of Contents
Fetching ...

In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata

TL;DR

The paper probes in-context impersonation by instructing LLMs to adopt social and domain-specific personas and evaluates effects across a bandit task, reasoning benchmarks, and vision-language classification. Using zero-shot prompting with persona prefixes, the study reveals human-like developmental patterns in exploration as well as domain-driven improvements in reasoning, while also uncov-ering race- and gender-associated biases in descriptions used for visual classification. The approach combines age-based, expertise-based, and demographic impersonations to reveal both strengths and biases of contemporary language models, and demonstrates how persona-generated text can influence downstream multimodal tasks. These findings inform both potential practical benefits and societal risks of persona-driven prompts, suggesting careful bias testing and mitigation as models scale and integrate into real-world systems.

Abstract

In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs' impersonations are complementary to visual information when describing different categories. We find that impersonation can improve performance: an LLM prompted to be a bird expert describes birds better than one prompted to be a car expert. However, impersonation can also uncover LLMs' biases: an LLM prompted to be a man describes cars better than one prompted to be a woman. These findings demonstrate that LLMs are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their hidden strengths and biases.

In-Context Impersonation Reveals Large Language Models' Strengths and Biases

TL;DR

The paper probes in-context impersonation by instructing LLMs to adopt social and domain-specific personas and evaluates effects across a bandit task, reasoning benchmarks, and vision-language classification. Using zero-shot prompting with persona prefixes, the study reveals human-like developmental patterns in exploration as well as domain-driven improvements in reasoning, while also uncov-ering race- and gender-associated biases in descriptions used for visual classification. The approach combines age-based, expertise-based, and demographic impersonations to reveal both strengths and biases of contemporary language models, and demonstrates how persona-generated text can influence downstream multimodal tasks. These findings inform both potential practical benefits and societal risks of persona-driven prompts, suggesting careful bias testing and mitigation as models scale and integrate into real-world systems.

Abstract

In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs' impersonations are complementary to visual information when describing different categories. We find that impersonation can improve performance: an LLM prompted to be a bird expert describes birds better than one prompted to be a car expert. However, impersonation can also uncover LLMs' biases: an LLM prompted to be a man describes cars better than one prompted to be a woman. These findings demonstrate that LLMs are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their hidden strengths and biases.
Paper Structure (33 sections, 4 equations, 19 figures)

This paper contains 33 sections, 4 equations, 19 figures.

Figures (19)

  • Figure 1: Our three tasks are designed to analyze the effect of in-context impersonation. First, we investigate bandit tasks (pink) where the LLM must maximize the reward while impersonating different age groups. Second, we evaluate the effect of domain expert impersonation on natural language reasoning tasks (yellow). Third, we study the usefulness of descriptions generated with impersonation w.r.t. age, expertise, ethnicity, and gender for visual classification (green).
  • Figure 2: Two-armed bandit task. Top: Average reward per persona (10k games of 10 trials), left: Age and # of trials have a positive effect on the expected reward, right: With age, exploration decreases, and exploitation increases.
  • Figure 3: Expertise-based impersonation on all domains of the MMLU reasoning benchmark (top) and on exemplary individual tasks (bottom). For each task, we consider four personas: the neutral, the task expert, the domain experts (all experts from the same domain except the task expert) and the non-domain experts (all experts from all remaining domains). The dashed line is the random baseline.
  • Figure 4: Comparing CLIP-32, CLIP-16 and OpenCLIP as VLMs (the language input comes from Vicuna-13B) on CUB (top) and Stanford Cars (bottom) datasets. We observe the effects of age, expertise, ethnicity and gender independent of the VLM used for fine-grained visual classification. The dashed line represents the random baseline.
  • Figure 5: Comparing Vicuna-13B and ChatGPT as LLM variants (OpenCLIP is the VLM) on CUB and Stanford Cars. For both LLMs, the accuracy increases with increasing age, the expert persona on the respective dataset performs better and both LLMs are not free of biases, and impersonation of different genders or race affects their performance. The dashed line represents the random baseline.
  • ...and 14 more figures