Physics-based phenomenological characterization of cross-modal bias in multimodal models

Hyeongmo Kim; Sohyun Kang; Yerin Choi; Seungyeon Ji; Junhyuk Woo; Hyunsuk Chung; Soyeon Caren Han; Kyungreem Han

Physics-based phenomenological characterization of cross-modal bias in multimodal models

Hyeongmo Kim, Sohyun Kang, Yerin Choi, Seungyeon Ji, Junhyuk Woo, Hyunsuk Chung, Soyeon Caren Han, Kyungreem Han

TL;DR

A surrogate physics-based model is developed that describes transformer dynamics to analyze the dynamics of cross-modal bias in MLLM, which are not fully captured by conventional embedding- or representation-level analyses.

Abstract

The term 'algorithmic fairness' is used to evaluate whether AI models operate fairly in both comparative (where fairness is understood as formal equality, such as "treat like cases as like") and non-comparative (where unfairness arises from the model's inaccuracy, arbitrariness, or inscrutability) contexts. Recent advances in multimodal large language models (MLLMs) are breaking new ground in multimodal understanding, reasoning, and generation; however, we argue that inconspicuous distortions arising from complex multimodal interaction dynamics can lead to systematic bias. The purpose of this position paper is twofold: first, it is intended to acquaint AI researchers with phenomenological explainable approaches that rely on the physical entities that the machine experiences during training/inference, as opposed to the traditional cognitivist symbolic account or metaphysical approaches; second, it is to state that this phenomenological doctrine will be practically useful for tackling algorithmic fairness issues in MLLMs. We develop a surrogate physics-based model that describes transformer dynamics (i.e., semantic network structure and self-/cross-attention) to analyze the dynamics of cross-modal bias in MLLM, which are not fully captured by conventional embedding- or representation-level analyses. We support this position through multi-input diagnostic experiments: 1) perturbation-based analyses of emotion classification using Qwen2.5-Omni and Gemma 3n, and 2) dynamical analysis of Lorenz chaotic time-series prediction through the physical surrogate. Across two architecturally distinct MLLMs, we show that multimodal inputs can reinforce modality dominance rather than mitigate it, as revealed by structured error-attractor patterns under systematic label perturbation, complemented by dynamical analysis.

Physics-based phenomenological characterization of cross-modal bias in multimodal models

TL;DR

Abstract

Paper Structure (10 sections, 6 equations, 5 figures, 1 table)

This paper contains 10 sections, 6 equations, 5 figures, 1 table.

Background and Introduction
Diagnostic Analysis of Multimodal Large Language Models
Dynamical Analysis of Multimodal Interaction Using a Physical Surrogate Model
Conclusions and Remarks
Acknowledgments

Figures (5)

Figure 1: Error-attractor structures under perturbations of emotion labels in multimodal large language models. Directed graphs visualize incorrect emotion classifications on the CREMA-D dataset under three input conditions: Face (Video) + Voice (Audio), Face (Video) only, and Voice (Audio) only. Nodes represent six emotion labels (happy, neutral, sad, angry, disgust, fear), and directed edges depict erroneous mappings from intended labels to predicted labels.
Figure 2: Sankey diagram displaying emotion predictions in Qwen2.5-Omni. The width of each flow indicates the number of samples assigned to each mapping: 1) from the intended label (left) to the model prediction (center), and 2) from the perceived label (right) to the model prediction (center).
Figure 3: Illustration of the prompt-based label perturbation strategy
Figure 4: Schematic diagram of Lorenz chaotic time-series prediction using the multi-oscillator model with self- and cross-attention mechanisms.
Figure 5: Characterization of the transformer dynamics using a physical testbed: Lorenz chaotic time-series prediction on a multi-oscillator system. (a) Modality preference is quantified by the difference in dynamical SHAP values, $\phi(Y)-\phi(X)$, across the self- and cross-attention levels $(\beta_\text{self},\beta_\text{cross})$. The SHAP difference is represented by the direction of the arrow in the range of $[-90^\circ,90^\circ]$: $0^\circ$ signify the equal contribution of $X$ and $Y$, $-90^\circ$ indicate the $X$-only; $90^\circ$ the $Y$-only. The arrow color represents the normalized mean squared error (NMSE) between target $z(t)$ and prediction. Predictions are visualized in the embedding space for two representative cases: (b) low self- and cross-attention levels, i.e., $(\beta_\text{self},\beta_\text{cross})= (10^{-4},10^{-4})$ and (c) high levels $(\beta_\text{self},\beta_\text{cross})= (10^{0},10^{0})$. To clarify, only the time series for $50\leq t \leq70$ are displayed.

Physics-based phenomenological characterization of cross-modal bias in multimodal models

TL;DR

Abstract

Physics-based phenomenological characterization of cross-modal bias in multimodal models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)