Table of Contents
Fetching ...

Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas

Francesco Sovrano, Gabriele Dominici, Rita Sevastjanova, Alessandra Stramiglio, Alberto Bacchelli

TL;DR

This work introduces PROBE-SWE, a dynamic benchmarking framework to measure data-induced cognitive biases in general-purpose AI solving software-engineering dilemmas. It converts bias-embedded natural-language tasks into Prolog programs with a rule-based axiomatic background, enabling exact checks of correctness and bias-driven reasoning depth. The authors generate thousands of bias-annotated dilemmas from a small seed corpus and evaluate GPT, LLaMA, and DeepSeek, finding consistent bias sensitivity that grows with problem complexity. The protocol emphasizes robustness and transferability, offering a scalable, domain-portable approach to debias GPAI in engineering workflows.

Abstract

Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases? To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions. To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88-99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning. We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over more complex reasoning. All systems exhibit bias sensitivity (6-35%), which increases with task complexity (up to 49%) and highlights risks in AI-driven software engineering.

Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas

TL;DR

This work introduces PROBE-SWE, a dynamic benchmarking framework to measure data-induced cognitive biases in general-purpose AI solving software-engineering dilemmas. It converts bias-embedded natural-language tasks into Prolog programs with a rule-based axiomatic background, enabling exact checks of correctness and bias-driven reasoning depth. The authors generate thousands of bias-annotated dilemmas from a small seed corpus and evaluate GPT, LLaMA, and DeepSeek, finding consistent bias sensitivity that grows with problem complexity. The protocol emphasizes robustness and transferability, offering a scalable, domain-portable approach to debias GPAI in engineering workflows.

Abstract

Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases? To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions. To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88-99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning. We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over more complex reasoning. All systems exhibit bias sensitivity (6-35%), which increases with task complexity (up to 49%) and highlights risks in AI-driven software engineering.

Paper Structure

This paper contains 94 sections, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Example of biased/unbiased dilemma
  • Figure 2: Dynamic benchmarking protocol overview
  • Figure 3: Bias sensitivity across six evaluated GPAI systems and three benchmark generation models.
  • Figure 4: Raw percent agreement $P_o$ by generator and task.
  • Figure 5: Comparison of majority-vote and unanimity pass rates by generator and task.
  • ...and 12 more figures