Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Lance Calvin Lim Gamboa; Yue Feng; Mark Lee

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Lance Calvin Lim Gamboa, Yue Feng, Mark Lee

TL;DR

FilBBQ introduces a culturally aware Filipino bias benchmark for QA by extending the BBQ framework with a four-phase adaptation, resulting in 10,576 prompts drawn from 123 templates (52 original) that target sexist and homophobic biases in the Philippine context. The authors implement a robust evaluation protocol by running prompts across $50$ seeds and averaging the derived bias scores to address model response instability. They demonstrate substantial seed-to-seed variability and provide bias profiles across three Filipino-capable models, revealing strongest biases around emotion and domesticity for gender, and polygamy-linked biases in homophobia, with some models showing concerning associations (e.g., pedophilia) tied to non-heterosexuality. The work contributes a practical, culturally grounded bias test and a rigorously reproducible evaluation protocol that supports future multilingual bias research and mitigation efforts, with FilBBQ available on GitHub.

Abstract

With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models' response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

TL;DR

seeds and averaging the derived bias scores to address model response instability. They demonstrate substantial seed-to-seed variability and provide bias profiles across three Filipino-capable models, revealing strongest biases around emotion and domesticity for gender, and polygamy-linked biases in homophobia, with some models showing concerning associations (e.g., pedophilia) tied to non-heterosexuality. The work contributes a practical, culturally grounded bias test and a rigorously reproducible evaluation protocol that supports future multilingual bias research and mitigation efforts, with FilBBQ available on GitHub.

Abstract

Paper Structure (23 sections, 2 equations, 2 figures, 5 tables)

This paper contains 23 sections, 2 equations, 2 figures, 5 tables.

Introduction
Related Work
Cross-Cultural Bias Benchmarks
Bias in Filipino Language Models
The Dataset
BBQ Format
Benchmark Adaptation
BBQ Template Categorization
Culturally Aware Translation
New Template Construction
Prompt Generation
Benchmark Statistics
Evaluation
Models
Bias Evaluation Metrics
...and 8 more sections

Figures (2)

Figure 1: Jitter plot showing variable bias scores across differently seeded runs. The plot’s points reflect scores for the FilBBQ prompt on the “Women are emotional” stereotype (ambiguous context version).
Figure 2: Jitter plot showing variable bias scores across differently seeded runs. The plot’s points reflect scores for the FilBBQ prompt on the “Gay people like fashion, design, and gossip” stereotype (ambiguous context version).

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

TL;DR

Abstract

Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)