Table of Contents
Fetching ...

BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models

Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, Adel Bibi

TL;DR

This work tackles tool-selection bias in tool-augmented LLMs, where models may favor functionally equivalent APIs due to superficial cues rather than utility. It introduces a benchmark of 10 clusters of interchangeable APIs and total-variation based metrics to quantify bias across seven models, revealing that semantic alignment between queries and tool descriptions is the primary predictor and that metadata perturbations and biased pre-training amplify bias. The authors dissect bias origins through attribute correlations, perturbations, and biased CPT, showing imageable effects from description content while CPT accelerates bias but does not solely drive it. To counteract bias, they propose a lightweight debiasing pipeline that filters candidates to a relevant subset and selects uniformly from that subset, which substantially reduces bias metrics while preserving task coverage. The findings underscore a practical need for fair, reliable tool-calling in LLM systems and provide readily adoptable resources and methods to improve marketplace fairness and user experience.

Abstract

Agents backed by large language models (LLMs) often rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical point concerning fairness: if selection is systematically biased, it can degrade user experience and distort competition by privileging some providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to evaluate tool-selection bias. Using this benchmark, we test seven models and show that unfairness exists with models either fixating on a single provider or disproportionately preferring earlier-listed tools in context. To investigate the origins of this bias, we conduct controlled experiments examining tool features, metadata (name, description, parameters), and pre-training exposure. We find that: (1) semantic alignment between queries and metadata is the strongest predictor of choice; (2) perturbing descriptions significantly shifts selections; and (3) repeated pre-training exposure to a single endpoint amplifies bias. Finally, we propose a lightweight mitigation that first filters the candidate tools to a relevant subset and then samples uniformly, reducing bias while preserving good task coverage. Our findings highlight tool-selection bias as a key obstacle for the fair deployment of tool-augmented LLMs.

BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models

TL;DR

This work tackles tool-selection bias in tool-augmented LLMs, where models may favor functionally equivalent APIs due to superficial cues rather than utility. It introduces a benchmark of 10 clusters of interchangeable APIs and total-variation based metrics to quantify bias across seven models, revealing that semantic alignment between queries and tool descriptions is the primary predictor and that metadata perturbations and biased pre-training amplify bias. The authors dissect bias origins through attribute correlations, perturbations, and biased CPT, showing imageable effects from description content while CPT accelerates bias but does not solely drive it. To counteract bias, they propose a lightweight debiasing pipeline that filters candidates to a relevant subset and selects uniformly from that subset, which substantially reduces bias metrics while preserving task coverage. The findings underscore a practical need for fair, reliable tool-calling in LLM systems and provide readily adoptable resources and methods to improve marketplace fairness and user experience.

Abstract

Agents backed by large language models (LLMs) often rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical point concerning fairness: if selection is systematically biased, it can degrade user experience and distort competition by privileging some providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to evaluate tool-selection bias. Using this benchmark, we test seven models and show that unfairness exists with models either fixating on a single provider or disproportionately preferring earlier-listed tools in context. To investigate the origins of this bias, we conduct controlled experiments examining tool features, metadata (name, description, parameters), and pre-training exposure. We find that: (1) semantic alignment between queries and metadata is the strongest predictor of choice; (2) perturbing descriptions significantly shifts selections; and (3) repeated pre-training exposure to a single endpoint amplifies bias. Finally, we propose a lightweight mitigation that first filters the candidate tools to a relevant subset and then samples uniformly, reducing bias while preserving good task coverage. Our findings highlight tool-selection bias as a key obstacle for the fair deployment of tool-augmented LLMs.

Paper Structure

This paper contains 38 sections, 1 equation, 18 figures, 7 tables, 1 algorithm.

Figures (18)

  • Figure 1: Tool-calling enables LLMs to act through external services, but the selection process introduces bias. Models may favor certain tools based on superficial metadata or position rather than relevance (here "weatherapi_com” is preferred), leading to a potential degraded user experience and unfair concentration of calls. If such biases are systematic across frontier LLMs, they risk distorting entire tool marketplaces, disadvantaging functionally equivalent competitors.
  • Figure 2: Cyclic rotations of one fixed tool list; each API appears at the top once.
  • Figure 3: Selection distributions for six LLMs across three clusters of functionally equivalent APIs. Each subplot corresponds to one cluster, with the x-axis indicating the API in the cluster and the y-axis showing the (mean) fraction of times each model chose that API over 500 inference runs; error bars indicate the standard deviation across three independent experimental runs. The optimal uniform selection rate is highlighted.
  • Figure 4: API- vs. positional bias by model for three clusters. Bars show total-variation deviation from uniform, where higher values indicate stronger bias.
  • Figure 5: Mean total-variation (TV) distance from the base selection distribution (no perturbation) to the distribution pertaining to each metadata perturbation (higher = larger shift). Blue bars show results for Gemini and orange bars for ChatGPT. The error bars denote standard deviation across clusters. The run-to-run standard deviation is left out; max run-to-run variability of per-run mean TV was $0.084$.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Definition 3.1