The Effect of Human v/s Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias

Kausik Lakkaraju; Aniket Gupta; Biplav Srivastava; Marco Valtorta; Dezhi Wu

The Effect of Human v/s Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias

Kausik Lakkaraju, Aniket Gupta, Biplav Srivastava, Marco Valtorta, Dezhi Wu

TL;DR

This work addresses bias in text-based Sentiment Analysis Systems (SASs) by moving beyond English synthetic data to include two human-generated datasets (HD1 ALLURE and HD2 Unibot) and round-trip translations via intermediate languages. It extends a causal-rating framework incorporating backdoor adjustment and metrics such as Weighted Rejection Score $WRS$ and Deconfounding Impact Estimation $DIE$ to assess bias under protected attributes. The findings show SASs exhibit more bias on human-generated data than synthetic data, and round-tripping can reduce bias for human data (HD) but increase bias for synthetic data (SD), with effects varying across datasets and languages. These results guide the design of more realistic SAS testing strategies to improve trust and reliability in mission-critical, multilingual deployments.

Abstract

Sentiment Analysis Systems (SASs) are data-driven Artificial Intelligence (AI) systems that output polarity and emotional intensity when given a piece of text as input. Like other AIs, SASs are also known to have unstable behavior when subjected to changes in data which can make it problematic to trust out of concerns like bias when AI works with humans and data has protected attributes like gender, race, and age. Recently, an approach was introduced to assess SASs in a blackbox setting without training data or code, and rating them for bias using synthetic English data. We augment it by introducing two human-generated chatbot datasets and also consider a round-trip setting of translating the data from one language to the same through an intermediate language. We find that these settings show SASs performance in a more realistic light. Specifically, we find that rating SASs on the chatbot data showed more bias compared to the synthetic data, and round-tripping using Spanish and Danish as intermediate languages reduces the bias (up to 68% reduction) in human-generated data while, in synthetic data, it takes a surprising turn by increasing the bias! Our findings will help researchers and practitioners refine their SAS testing strategies and foster trust as SASs are considered part of more mission-critical applications for global use.

The Effect of Human v/s Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias

TL;DR

and Deconfounding Impact Estimation

to assess bias under protected attributes. The findings show SASs exhibit more bias on human-generated data than synthetic data, and round-tripping can reduce bias for human data (HD) but increase bias for synthetic data (SD), with effects varying across datasets and languages. These results guide the design of more realistic SAS testing strategies to improve trust and reliability in mission-critical, multilingual deployments.

Abstract

Paper Structure (28 sections, 3 equations, 6 figures, 5 tables)

This paper contains 28 sections, 3 equations, 6 figures, 5 tables.

Introduction
Background
Problem
Notation
Formulation
Overview of AI Systems Rating
Causal Model
Data
Systems Evaluated
Rating Methodology
Performing Statistical Tests to Assess Causal Dependency
Assigning Final Ratings
Limitations
Data and Rating Methods
Data Used
...and 13 more sections

Figures (6)

Figure 1: Causal model for rating SASs
Figure 2: Snapshot of the preprocessed ALLURE dataset (HD1)
Figure 3: Snapshot of the preprocessed Unibot data (HD2)
Figure 4: Methodology for comparing bias scores on original and round-trip translated data.
Figure 5: Causal model for rating SASs on HD1.
...and 1 more figures

The Effect of Human v/s Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias

TL;DR

Abstract

The Effect of Human v/s Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias

Authors

TL;DR

Abstract

Table of Contents

Figures (6)