On the Reliability of Cue Conflict and Beyond

Pum Jun Kim; Seung-Ah Lee; Seongho Park; Dongyoon Han; Jaejun Yoo

On the Reliability of Cue Conflict and Beyond

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

Abstract

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

On the Reliability of Cue Conflict and Beyond

Abstract

Paper Structure (35 sections, 4 equations, 18 figures, 7 tables)

This paper contains 35 sections, 4 equations, 18 figures, 7 tables.

Introduction
Revisiting Cue Preference Benchmarking
The Cue-conflict Paradigm
Limitations of the Current Cue-conflict Instantiation
Problem 1. Stylization Undermines Cue Reliability
Problem 2. Relative Bias Obscures Cue Sensitivity
Problem 3. Restricted Label Evaluation Distorts Model Predictions
Prior Critiques and the Remaining Gaps
Methodology
Shape and Texture Cue Construction
Model Comparisons with Redefined Bias
Experiments
Validating REFINED-BIAS Benchmark
What Becomes Visible Once Bias Is Measured Reliably
Discussion
...and 20 more sections

Figures (18)

Figure 1: Empirical instability of stylized image-based bias evaluation. (a) illustrates the core insight of cue-conflict that stronger shape bias, similar to humans, improves in-domain performance. (b) shows examples of unrecognizable cues in the cue-conflict dataset. (c) illustrates conflicting findings on this core insight of cue-conflict.
Figure 2: Limitations of the cue-conflict benchmark: Although it has offered a valuable and well-designed framework for studying shape and texture biases, we argue that several limitations warrant further attention: (a) cue entanglement caused by stylization, where shape and texture information leak into each other, (b) imbalanced cue information caused by stylization, leading to unfair predictive contributions, (c) ignoring differences in cue sensitivity, which prevents distinguishing models with genuine biases, and (d) evaluation restricted to preselected classes.
Figure 3: Examples of imperfect cue separation in the cue-conflict dataset. (a) Qualitative examples of ambiguously extracted shape and texture cues. (b) Kendall’s rank correlation of class-wise model top-1 accuracy on stylized cues and pure shape stimuli. All ImageNet-1k pretrained CNN and ViT models listed in Appendix \ref{['app_sec:model_across_structure']} are utilized.
Figure 4: Unequal recognizability of cues in the cue-conflict dataset. (a) Qualitative examples of unequal informativeness, and (b) human and model perception trends, on shape and texture cues. Model top-1 accuracies are shown with bars indicating 95% confidence intervals. All ImageNet-1k pretrained CNN and ViT models listed in Appendix \ref{['app_sec:model_across_structure']} are utilized. See §\ref{['sec:shape_texture_stimuli']} for human studies.
Figure 5: Illustration of the difference between (a) true model prediction and (b) distorted model prediction. See Appendix \ref{['app:false_positive']} for more details.
...and 13 more figures

On the Reliability of Cue Conflict and Beyond

Abstract

On the Reliability of Cue Conflict and Beyond

Authors

Abstract

Table of Contents

Figures (18)