Table of Contents
Fetching ...

On the Reliability of Cue Conflict and Beyond

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

Abstract

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

On the Reliability of Cue Conflict and Beyond

Abstract

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.
Paper Structure (35 sections, 4 equations, 18 figures, 7 tables)

This paper contains 35 sections, 4 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Empirical instability of stylized image-based bias evaluation. (a) illustrates the core insight of cue-conflict that stronger shape bias, similar to humans, improves in-domain performance. (b) shows examples of unrecognizable cues in the cue-conflict dataset. (c) illustrates conflicting findings on this core insight of cue-conflict.
  • Figure 2: Limitations of the cue-conflict benchmark: Although it has offered a valuable and well-designed framework for studying shape and texture biases, we argue that several limitations warrant further attention: (a) cue entanglement caused by stylization, where shape and texture information leak into each other, (b) imbalanced cue information caused by stylization, leading to unfair predictive contributions, (c) ignoring differences in cue sensitivity, which prevents distinguishing models with genuine biases, and (d) evaluation restricted to preselected classes.
  • Figure 3: Examples of imperfect cue separation in the cue-conflict dataset. (a) Qualitative examples of ambiguously extracted shape and texture cues. (b) Kendall’s rank correlation of class-wise model top-1 accuracy on stylized cues and pure shape stimuli. All ImageNet-1k pretrained CNN and ViT models listed in Appendix \ref{['app_sec:model_across_structure']} are utilized.
  • Figure 4: Unequal recognizability of cues in the cue-conflict dataset. (a) Qualitative examples of unequal informativeness, and (b) human and model perception trends, on shape and texture cues. Model top-1 accuracies are shown with bars indicating 95% confidence intervals. All ImageNet-1k pretrained CNN and ViT models listed in Appendix \ref{['app_sec:model_across_structure']} are utilized. See §\ref{['sec:shape_texture_stimuli']} for human studies.
  • Figure 5: Illustration of the difference between (a) true model prediction and (b) distorted model prediction. See Appendix \ref{['app:false_positive']} for more details.
  • ...and 13 more figures