Table of Contents
Fetching ...

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Dipika Khullar, Jack Hopkins, Rowan Wang, Fabien Roger

TL;DR

Across four coding and tool-use datasets, it is found that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context presented in a user turn.

Abstract

Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-use actions. We show that this design pattern can fail when the action is presented in a previous or in the same assistant turn instead of being presented by the user in a user turn. We define self-attribution bias as the tendency of a model to evaluate an action as more correct or less risky when the action is implicitly framed as its own, compared to when the same action is evaluated under off-policy attribution. Across four coding and tool-use datasets, we find that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context presented in a user turn. In contrast, explicitly stating that the action comes from the monitor does not by itself induce self-attribution bias. Because monitors are often evaluated on fixed examples rather than on their own generated actions, these evaluations can make monitors appear more reliable than they actually are in deployment, leading developers to unknowingly deploy inadequate monitors in agentic systems.

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

TL;DR

Across four coding and tool-use datasets, it is found that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context presented in a user turn.

Abstract

Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-use actions. We show that this design pattern can fail when the action is presented in a previous or in the same assistant turn instead of being presented by the user in a user turn. We define self-attribution bias as the tendency of a model to evaluate an action as more correct or less risky when the action is implicitly framed as its own, compared to when the same action is evaluated under off-policy attribution. Across four coding and tool-use datasets, we find that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context presented in a user turn. In contrast, explicitly stating that the action comes from the monitor does not by itself induce self-attribution bias. Because monitors are often evaluated on fixed examples rather than on their own generated actions, these evaluations can make monitors appear more reliable than they actually are in deployment, leading developers to unknowingly deploy inadequate monitors in agentic systems.
Paper Structure (55 sections, 1 equation, 17 figures, 1 table)

This paper contains 55 sections, 1 equation, 17 figures, 1 table.

Figures (17)

  • Figure 1: Illustration of self-attribution bias in a computer use setting. When asked to evaluate the risk of an action in a fresh context (left), models usually correctly assign a high risk score. But when instead we ask for a risk score as a follow-up question (after the risky action has been taken) (right), models often rate the action they just took as less risky than in the baseline setting, despite the action being judge being the same. This effect is strongest when the model rating the action is actually the one that generated the action (instead of the action having been generated by e.g. another model).
  • Figure 2: Code Harmfulness (higher = safer)
  • Figure 3: Code PR Correctness
  • Figure 4: PR Approval Recommendation Rate
  • Figure 6: Self-attribution selectively inflates ratings for incorrect code solutions. We plot ratings of code generated by GPT-5 under two evaluation conditions: on-policy, where GPT-5 rates its own code, and off-policy, where Claude-Sonnet-4 rates GPT-5's code. We fit a Gaussian to the each cluster of points (described by the legend) by computing covariance matrices described in the legend, and show the $1.5\sigma$ Gaussian contours (for legibility) as well as the mean ratings. We find that the cluster of on-policy incorrect ratings is shifted up relative to the off-policy one, which means baseline ratings stay mostly unchanged while self-attributed ratings are inflated. The inflation for correct points is not as big, which results in a worse classifier discrimination on-policy than off-policy.
  • ...and 12 more figures