ViLBias: Detecting and Reasoning about Bias in Multimodal Content

Shaina Raza; Caesar Saleh; Azib Farooq; Emrul Hasan; Franklin Ogidi; Maximus Powers; Veronica Chatrath; Marcelo Lotif; Karanpal Sekhon; Roya Javadi; Haad Zahid; Anam Zahid; Vahid Reza Khazaie; Zhenyu Yu

ViLBias: Detecting and Reasoning about Bias in Multimodal Content

Shaina Raza, Caesar Saleh, Azib Farooq, Emrul Hasan, Franklin Ogidi, Maximus Powers, Veronica Chatrath, Marcelo Lotif, Karanpal Sekhon, Roya Javadi, Haad Zahid, Anam Zahid, Vahid Reza Khazaie, Zhenyu Yu

TL;DR

ViLBias introduces BiasCorpus, a multimodal bias benchmark with 40k text-image pairs and concise rationales produced via a hybrid LLM-as-annotator and HITL validation. It defines a VQA-style evaluation for both closed-ended bias classification and open-ended reasoning (oVQA) across SLMs, LLMs, and VLMs, including tests of parameter-efficient fine-tuning. Results show that incorporating images yields 3–5% improvements in detection accuracy, and instruction tuning significantly boosts reasoning performance (52–79% with 68–89% faithfulness) with a strong correlation (r=0.91) between classification and reasoning quality. While ViLBias provides scalable baselines and a rigorous framework for multimodal bias detection, it also highlights limitations in bias taxonomy, annotation subjectivity, and governance, pointing to future work on broader cultural coverage, robustness, and responsible deployment.

Abstract

Detecting bias in multimodal news requires models that reason over text--image pairs, not just classify text. In response, we present ViLBias, a VQA-style benchmark and framework for detecting and reasoning about bias in multimodal news. The dataset comprises 40,945 text--image pairs from diverse outlets, each annotated with a bias label and concise rationale using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended classification and open-ended reasoning (oVQA), and compare parameter-efficient tuning strategies. Results show that incorporating images alongside text improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle framing and text--image inconsistencies than SLMs. Parameter-efficient methods (LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with $<5\%$ trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and faithfulness 68--89\%, both improved by instruction tuning; closed accuracy correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable benchmark and strong baselines for multimodal bias detection and rationale quality.

ViLBias: Detecting and Reasoning about Bias in Multimodal Content

TL;DR

Abstract

ViLBias: Detecting and Reasoning about Bias in Multimodal Content

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)