DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

Jonathan Skaza; Parsa Madinei; Ziqi Wen; Miguel Eckstein

DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

Jonathan Skaza, Parsa Madinei, Ziqi Wen, Miguel Eckstein

TL;DR

This paper tackles the problem of predicting human-perceived image complexity using a vision-only approach. It introduces DReX, a compact fusion of self-supervised DINOv3 CLS tokens and multi-scale ResNet-50 features through a lightweight attention mechanism, with frozen backbones and a small regression head. On the IC9600 benchmark, DReX achieves $r=0.9581$ and $\rho=0.9542$, while demonstrating strong generalization to SAVOIAS and PASCAL VOC and using far fewer trainable parameters than multimodal competitors. The results imply that high-quality visual representations alone can align with human complexity judgments, offering a parameter-efficient template for other subjective perceptual tasks.

Abstract

Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.

DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

TL;DR

Abstract

DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)