Table of Contents
Fetching ...

Feedback Forensics: A Toolkit to Measure AI Personality

Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Robert Mullins

TL;DR

This work tackles the challenge of evaluating AI personality, proposing Feedback Forensics, an open-source toolkit that explicitly measures personality traits inferred from human feedback and model behavior. It relies on a pairwise-response data paradigm with annotations from humans, a target model, and AI annotators, and computes metrics such as relevance, Cohen's kappa, and a strength score to quantify trait alignment. The authors demonstrate the framework across datasets (Chatbot Arena, MultiPref, PRISM) and multiple model families, revealing how feedback shapes personality and how models differ in trait expression. The toolkit, accompanying web app, and annotated data enable reproducible, fine-grained analysis of AI personality and offer a path toward more desirable, controllable model behavior in practice.

Abstract

Some traits making a "good" AI model are hard to describe upfront. For example, should responses be more polite or more casual? Such traits are sometimes summarized as model character or personality. Without a clear objective, conventional benchmarks based on automatic validation struggle to measure such traits. Evaluation methods using human feedback such as Chatbot Arena have emerged as a popular alternative. These methods infer "better" personality and other desirable traits implicitly by ranking multiple model responses relative to each other. Recent issues with model releases highlight limitations of these existing opaque evaluation approaches: a major model was rolled back over sycophantic personality issues, models were observed overfitting to such feedback-based leaderboards. Despite these known issues, limited public tooling exists to explicitly evaluate model personality. We introduce Feedback Forensics: an open-source toolkit to track AI personality changes, both those encouraged by human (or AI) feedback, and those exhibited across AI models trained and evaluated on such feedback. Leveraging AI annotators, our toolkit enables investigating personality via Python API and browser app. We demonstrate the toolkit's usefulness in two steps: (A) first we analyse the personality traits encouraged in popular human feedback datasets including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to analyse how much popular models exhibit such traits. We release (1) our Feedback Forensics toolkit alongside (2) a web app tracking AI personality in popular models and feedback datasets as well as (3) the underlying annotation data at https://github.com/rdnfn/feedback-forensics.

Feedback Forensics: A Toolkit to Measure AI Personality

TL;DR

This work tackles the challenge of evaluating AI personality, proposing Feedback Forensics, an open-source toolkit that explicitly measures personality traits inferred from human feedback and model behavior. It relies on a pairwise-response data paradigm with annotations from humans, a target model, and AI annotators, and computes metrics such as relevance, Cohen's kappa, and a strength score to quantify trait alignment. The authors demonstrate the framework across datasets (Chatbot Arena, MultiPref, PRISM) and multiple model families, revealing how feedback shapes personality and how models differ in trait expression. The toolkit, accompanying web app, and annotated data enable reproducible, fine-grained analysis of AI personality and offer a path toward more desirable, controllable model behavior in practice.

Abstract

Some traits making a "good" AI model are hard to describe upfront. For example, should responses be more polite or more casual? Such traits are sometimes summarized as model character or personality. Without a clear objective, conventional benchmarks based on automatic validation struggle to measure such traits. Evaluation methods using human feedback such as Chatbot Arena have emerged as a popular alternative. These methods infer "better" personality and other desirable traits implicitly by ranking multiple model responses relative to each other. Recent issues with model releases highlight limitations of these existing opaque evaluation approaches: a major model was rolled back over sycophantic personality issues, models were observed overfitting to such feedback-based leaderboards. Despite these known issues, limited public tooling exists to explicitly evaluate model personality. We introduce Feedback Forensics: an open-source toolkit to track AI personality changes, both those encouraged by human (or AI) feedback, and those exhibited across AI models trained and evaluated on such feedback. Leveraging AI annotators, our toolkit enables investigating personality via Python API and browser app. We demonstrate the toolkit's usefulness in two steps: (A) first we analyse the personality traits encouraged in popular human feedback datasets including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to analyse how much popular models exhibit such traits. We release (1) our Feedback Forensics toolkit alongside (2) a web app tracking AI personality in popular models and feedback datasets as well as (3) the underlying annotation data at https://github.com/rdnfn/feedback-forensics.

Paper Structure

This paper contains 46 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Overview of our Feedback Forensics toolkit.
  • Figure 2: Example of model personality differences. All models decipher the HTTP acronym correctly but the manner or personality of their responses varies. The ChatGPT version of GPT-4o uses more bold and emojis than the standard API version. The Gemini model is more verbose and uses different formatting than the GPT models. Standard benchmarks fail to identify these differences in models' personalities -- Feedback Forensics can quantify them.
  • Figure 3: Illustration of Feedback Forensics' method to measure personality traits. We take pairwise model response data as input, where each datapoint consists of a prompt (yellow) and two corresponding model responses (white). Optionally, additional metadata may be included (e.g. generating model for each response). In Step 1, we add annotations to each datapoint selecting response A, response B, both or neither responses. To understand personality traits encouraged by human preferences, we include a (1) human annotation (green) selecting the human-preferred response. Such annotations can be imported from external sources (e.g. Chatbot Arena) alongside the pairwise model response data. To understand the personality traits exhibited by a target model (e.g. a Claude model), we add a (2) target model annotation (red) using hard-coded rules on response metadata to select the response generated by the model (if available). Finally, using AI annotators, we add (3) personality annotations (blue) that select the response that exhibits a trait more (e.g. that is more confident). We collect one such annotation per datapoint and tested trait. In Step 2, we compare these annotations to compute personality metrics. To understand how much a specific personality trait is encouraged by human feedback (Result A), we compare human annotations (green) to personality annotations (blue) for that trait. High agreement (measured via strength metric, see \ref{['sec:method:metrics']}), indicates that the trait (or a highly correlated trait) is encouraged by human feedback. Low agreement indicates that the trait is discouraged. Similarly, to observe how much a target model exhibits a certain trait (Result B), we compare target model annotations (red) to that trait's personality annotations (blue). High agreement indicates that the trait uniquely identifies the model (relative to other models in dataset), i.e. the model exhibits the trait more than other models. Low agreement indicates the model exhibits the trait less than other models.
  • Figure 4: Interpretation of strength metric in both use-cases. At the top, interpretation of strength metric when comparing human feedback and personality trait annotations of a specific trait (Result A). At the bottom, interpretation of strength metric when comparing target model and personality trait annotations of a specific trait (Result B). Colour here indicates the sign and magnitude of the strength metric rather than annotation type.
  • Figure 5: Most encouraged(blue)and discouraged(red)personality traits in Chatbot Arena. We observe a strong emphasis on encouraging better structured, more verbose and more confident responses. On the other hand, more concise or avoidant responses are discouraged. All measurements using strength metric.
  • ...and 16 more figures