Table of Contents
Fetching ...

DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

Jonathan Skaza, Parsa Madinei, Ziqi Wen, Miguel Eckstein

TL;DR

This paper tackles the problem of predicting human-perceived image complexity using a vision-only approach. It introduces DReX, a compact fusion of self-supervised DINOv3 CLS tokens and multi-scale ResNet-50 features through a lightweight attention mechanism, with frozen backbones and a small regression head. On the IC9600 benchmark, DReX achieves $r=0.9581$ and $\rho=0.9542$, while demonstrating strong generalization to SAVOIAS and PASCAL VOC and using far fewer trainable parameters than multimodal competitors. The results imply that high-quality visual representations alone can align with human complexity judgments, offering a parameter-efficient template for other subjective perceptual tasks.

Abstract

Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.

DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

TL;DR

This paper tackles the problem of predicting human-perceived image complexity using a vision-only approach. It introduces DReX, a compact fusion of self-supervised DINOv3 CLS tokens and multi-scale ResNet-50 features through a lightweight attention mechanism, with frozen backbones and a small regression head. On the IC9600 benchmark, DReX achieves and , while demonstrating strong generalization to SAVOIAS and PASCAL VOC and using far fewer trainable parameters than multimodal competitors. The results imply that high-quality visual representations alone can align with human complexity judgments, offering a parameter-efficient template for other subjective perceptual tasks.

Abstract

Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.

Paper Structure

This paper contains 25 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: DReX architecture for visual complexity prediction. Multi-scale ResNet features and DINOv3's [CLS] embedding are adaptively fused via learned attention. ResNet blocks 1–4 provide hierarchical spatial information at different scales, while DINOv3’s [CLS] token captures global semantic content. The attention fusion module learns to weight each representation, and an MLP head regresses the complexity score. Snowflakes indicate frozen pretrained weights; flames indicate trainable components ($\sim$1.8M parameters).
  • Figure 2: Ablation experiments on the IC9600 test set. (a) Change in correlation when ablating the entire DINO branch from DReX. (b) Change in correlation when ablating the entire ResNet branch from DReX. (c) Change in Pearson correlation ($r$) when ablating individual DINO embedding dimensions. Green lines indicate FDR-corrected significance at $p<0.01$. (d) Change in Pearson correlation ($r$) when ablating individual ResNet feature dimensions. Green bars indicate FDR-corrected significance at $p<0.01$. $^{***}$ denotes permutation test significance at $p<0.001$.
  • Figure 3: Exploration of the DINOv3 [CLS] token in DReX. (a) Pearson correlation between the attention weight, $w_d$, placed on the DINOv3 [CLS] token and the ground truth complexity score for all images in the IC9600 test set. Marginal histograms are included for each variable. (b) Dimension-level importance of the DINOv3 [CLS] token. The distribution of importance scores is positively skewed ($p < 0.001$).
  • Figure : Kernel density estimation of Pearson correlation ($r$) and Spearman's rank correlation ($\rho$) across 10 different random seeds for the models DReX, ICNet, D2S-R50, and D2S-R18.
  • Figure : Comparison of training time between DReX and D2S-R50. Using DReX, we precompute ResNet and DINOv3 features once and reuse them across all training epochs, which significantly reduces the time per epoch. The initial precomputation takes approximately 110 s for the IC9600 training set, after which each epoch takes only $\sim$1.2 s. In contrast, each epoch of D2S-R50 requires $\sim$130.5 s. These estimates were obtained on a workstation featuring an AMD Ryzen Threadripper 7970X 32-core processor and an NVIDIA RTX Pro 6000 Blackwell Workstation GPU.