Table of Contents
Fetching ...

Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

Yiran Zhao, Lu Zhou, Xiaogang Xu, Zhe Liu, Jiafei Wu, Liming Fang

TL;DR

The IRIS Benchmark is introduced, to the authors' knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs, and is capable of integrating evolving fairness metrics, ultimately helping to resolve the "Tower of Babel''impasse.

Abstract

As artificial intelligence (AI) is increasingly deployed across domains, ensuring fairness has become a core challenge. However, the field faces a "Tower of Babel'' dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms-particularly in unified Multimodal Large Language Models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs. Enabled by our demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional "fairness space'', integrating 60 granular metrics across three dimensions-Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the "generation gap'', individual inconsistencies like "personality splits'', and the "counter-stereotype reward'', while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating evolving fairness metrics, ultimately helping to resolve the "Tower of Babel'' impasse. Project Page: https://iris-benchmark-web.vercel.app/

Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

TL;DR

The IRIS Benchmark is introduced, to the authors' knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs, and is capable of integrating evolving fairness metrics, ultimately helping to resolve the "Tower of Babel''impasse.

Abstract

As artificial intelligence (AI) is increasingly deployed across domains, ensuring fairness has become a core challenge. However, the field faces a "Tower of Babel'' dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms-particularly in unified Multimodal Large Language Models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs. Enabled by our demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional "fairness space'', integrating 60 granular metrics across three dimensions-Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the "generation gap'', individual inconsistencies like "personality splits'', and the "counter-stereotype reward'', while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating evolving fairness metrics, ultimately helping to resolve the "Tower of Babel'' impasse. Project Page: https://iris-benchmark-web.vercel.app/
Paper Structure (124 sections, 3 equations, 22 figures, 37 tables, 1 algorithm)

This paper contains 124 sections, 3 equations, 22 figures, 37 tables, 1 algorithm.

Figures (22)

  • Figure 1: Conceptual illustration of the IRIS benchmark. This diagram shows the overall workflow, starting from (a) the models being tested (\ref{['sec:appendix_model list']}). The evaluation is enabled by (b) our specialized toolkit, including the ARES classifier and four datasets (\ref{['sec:tools']}). (c) Raw, granular fairness metrics are calculated (\ref{['sec:pipeline']}) and (d) projected into a high-dimensional "fairness space" where distance from the origin (The IRIS Fairness Singularity) quantifies bias (\ref{['sec:theory']}). (e) This space is structured by the IRIS benchmark's three core dimensions (Ideal Fairness, Real-world Fidelity, Bias Inertia & Steerability) applied across two tasks (Generation, Understanding), producing six interpretable evaluation sectors (\ref{['sec:metrics']}). (f) The final output includes quantitative scores and a qualitative "personality" profile, providing a holistic diagnosis of the model's fairness characteristics (\ref{['sec:exp']}, \ref{['sec:analysis']}).
  • Figure 2: Schematic of the IRIS benchmark evaluation pipeline, illustrating the dual-task and three-dimensional assessment, the scoring flow, and the final projection into the high-dimensional "fairness space". [Mi] refers to the specific metrics listed in \ref{['tab:metrics_framework']}; * denotes detailed raw data processing rules provided in \ref{['ssec:protocols']}; † indicates the aggregation procedure described in \ref{['sec:theory']} and \ref{['Appendix:mechanism']}; ‡ refers to real-world data used to calculate RFS score and specifications reported in \ref{['ssec:real_world_data']}.
  • Figure 3: Schematic diagram of ARES Classifier. Specific model information, training data, details and adaptive routing rules can be found in \ref{['sec:appendix_ares']}.
  • Figure 4: Validation of the IRIS benchmark design, confirming its (a) reliability via internal consistency, (b) robustness to parameter changes (floating 10%), (c) validity through dimensional correlation analysis, and (d) impartiality across model architectures. *Acceptable Threshold ($\alpha=0.7$).
  • Figure 5: Schematic diagram of the experimental process for exploring the internal mechanisms of the BLIP3-o and Harmon models. 1) Gray and 2) Black arrows represent the flow of data in 1) generation and 2) understanding. Detailed settings/results of mechanistic probe experiments are shown in \ref{['ssec:mechanistic_protocols']}.
  • ...and 17 more figures