Table of Contents
Fetching ...

ViLBias: Detecting and Reasoning about Bias in Multimodal Content

Shaina Raza, Caesar Saleh, Azib Farooq, Emrul Hasan, Franklin Ogidi, Maximus Powers, Veronica Chatrath, Marcelo Lotif, Karanpal Sekhon, Roya Javadi, Haad Zahid, Anam Zahid, Vahid Reza Khazaie, Zhenyu Yu

TL;DR

ViLBias introduces BiasCorpus, a multimodal bias benchmark with 40k text-image pairs and concise rationales produced via a hybrid LLM-as-annotator and HITL validation. It defines a VQA-style evaluation for both closed-ended bias classification and open-ended reasoning (oVQA) across SLMs, LLMs, and VLMs, including tests of parameter-efficient fine-tuning. Results show that incorporating images yields 3–5% improvements in detection accuracy, and instruction tuning significantly boosts reasoning performance (52–79% with 68–89% faithfulness) with a strong correlation (r=0.91) between classification and reasoning quality. While ViLBias provides scalable baselines and a rigorous framework for multimodal bias detection, it also highlights limitations in bias taxonomy, annotation subjectivity, and governance, pointing to future work on broader cultural coverage, robustness, and responsible deployment.

Abstract

Detecting bias in multimodal news requires models that reason over text--image pairs, not just classify text. In response, we present ViLBias, a VQA-style benchmark and framework for detecting and reasoning about bias in multimodal news. The dataset comprises 40,945 text--image pairs from diverse outlets, each annotated with a bias label and concise rationale using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended classification and open-ended reasoning (oVQA), and compare parameter-efficient tuning strategies. Results show that incorporating images alongside text improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle framing and text--image inconsistencies than SLMs. Parameter-efficient methods (LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with $<5\%$ trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and faithfulness 68--89\%, both improved by instruction tuning; closed accuracy correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable benchmark and strong baselines for multimodal bias detection and rationale quality.

ViLBias: Detecting and Reasoning about Bias in Multimodal Content

TL;DR

ViLBias introduces BiasCorpus, a multimodal bias benchmark with 40k text-image pairs and concise rationales produced via a hybrid LLM-as-annotator and HITL validation. It defines a VQA-style evaluation for both closed-ended bias classification and open-ended reasoning (oVQA) across SLMs, LLMs, and VLMs, including tests of parameter-efficient fine-tuning. Results show that incorporating images yields 3–5% improvements in detection accuracy, and instruction tuning significantly boosts reasoning performance (52–79% with 68–89% faithfulness) with a strong correlation (r=0.91) between classification and reasoning quality. While ViLBias provides scalable baselines and a rigorous framework for multimodal bias detection, it also highlights limitations in bias taxonomy, annotation subjectivity, and governance, pointing to future work on broader cultural coverage, robustness, and responsible deployment.

Abstract

Detecting bias in multimodal news requires models that reason over text--image pairs, not just classify text. In response, we present ViLBias, a VQA-style benchmark and framework for detecting and reasoning about bias in multimodal news. The dataset comprises 40,945 text--image pairs from diverse outlets, each annotated with a bias label and concise rationale using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended classification and open-ended reasoning (oVQA), and compare parameter-efficient tuning strategies. Results show that incorporating images alongside text improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle framing and text--image inconsistencies than SLMs. Parameter-efficient methods (LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and faithfulness 68--89\%, both improved by instruction tuning; closed accuracy correlates strongly with reasoning (). ViLBias offers a scalable benchmark and strong baselines for multimodal bias detection and rationale quality.

Paper Structure

This paper contains 26 sections, 2 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Examples from our dataset with rationales supporting the biased label.
  • Figure 2: Sample headlines and their associated images. Each image corresponds to a specific news article, illustrating the multimodal structure of the dataset.
  • Figure 3: ViLBias Framework. The pipeline comprises three main stages: (1) Data Collection and Preprocessing, (2) Automated and Human-Annotated Labeling, and (3) Final Evaluation.
  • Figure 4: LLM-As-Annotators with Human-In-The-Loop (HITL) Bias Annotation Framework. Each LLM is queried three times, and its responses are majority-voted to remove stochasticity. These majority-voted labels are taken from three LLMs and then majority-voted (across LLMs) again to produce a final label, which is reviewed---along with LLM reasoning---by human annotators.
  • Figure 5: Top 10 outlets
  • ...and 7 more figures