Table of Contents
Fetching ...

BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models

Zsolt T. Kardkovacs, Lynda Djennane, Anna Field, Boualem Benatallah, Yacine Gaci, Fabio Casati, Walid Gaaloul

TL;DR

BTC-SAM tackles the problem of biases in sentiment analysis by using few-shot prompts with large language models to generate diverse, naturalistic bias test cases with minimal specification. The framework combines a Bias Test Specification step, Example Test Sentence Generation, and Counterfactual Sentence Pair Generation, followed by three diversity augmentations (Lexical, Syntactic, and Semantic) to expand coverage. A formal bias-detection metric $\frac{1}{|M||T|}\sum_{m\in M}\sum_{t\in T} F(t,m)$ with $F(t,m)=1$ if counterfactual pairs elicit different model outputs under model $m$ is used to evaluate effectiveness, and BTC-SAM shows competitive bias discovery compared to EEC, CrowS-Pairs, and BiasTestGPT while yielding greater lexical and syntactic diversity. The study also demonstrates that incorporating diversity through paraphrasing improves the ability to uncover unseen biases, although the approach relies on LLMs that carry inherent biases and require manual validation. Overall, BTC-SAM advances bias testing by reducing human effort, expanding linguistic coverage, and enabling generalization to previously unseen bias types, with broad implications for safer, fairer SA systems and beyond.

Abstract

Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects. Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.

BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models

TL;DR

BTC-SAM tackles the problem of biases in sentiment analysis by using few-shot prompts with large language models to generate diverse, naturalistic bias test cases with minimal specification. The framework combines a Bias Test Specification step, Example Test Sentence Generation, and Counterfactual Sentence Pair Generation, followed by three diversity augmentations (Lexical, Syntactic, and Semantic) to expand coverage. A formal bias-detection metric with if counterfactual pairs elicit different model outputs under model is used to evaluate effectiveness, and BTC-SAM shows competitive bias discovery compared to EEC, CrowS-Pairs, and BiasTestGPT while yielding greater lexical and syntactic diversity. The study also demonstrates that incorporating diversity through paraphrasing improves the ability to uncover unseen biases, although the approach relies on LLMs that carry inherent biases and require manual validation. Overall, BTC-SAM advances bias testing by reducing human effort, expanding linguistic coverage, and enabling generalization to previously unseen bias types, with broad implications for safer, fairer SA systems and beyond.

Abstract

Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects. Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.

Paper Structure

This paper contains 33 sections, 1 equation, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Overview of our BTC-SAM framework pipeline