Table of Contents
Fetching ...

On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness

Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Alexandra Gómez-Villa, Jorge Vila-Tomás, Valero Laparra, Jesus Malo

TL;DR

An epoch-by-epoch analysis of CLIP models throughout training reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human perception.

Abstract

Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.

On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness

TL;DR

An epoch-by-epoch analysis of CLIP models throughout training reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human perception.

Abstract

Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.

Paper Structure

This paper contains 16 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Left: ImageNet-1k accuracy for CLIP-base16 during the training. Right: CLIP-base16 texture bias based on cue-conflict images.
  • Figure 2: Evolution of CLIP-base16 on four different datasets during its training. From left to right, it shows: 1. Noise sensitivity: Measured as the normalized accuracy drop on ImageNet-1k-C gaussian noise corrupted images. 2. Low-level alignment: Mean model correlation with human mean opinion score on both TID2013 and KADID-10K. 3. Mid-level alignment: Model accuracy with human preferences on the NIGHTS perceptual database. 4. Saliency alignment: Model correlation with human saliency maps from .
  • Figure 3: Evolution of the analyzed metrics for three different CLIP scales (base, large, huge).