Foundation Models Boost Low-Level Perceptual Similarity Metrics
Abhijay Ghildyal, Nabajeet Barman, Saman Zadtootaghaj
TL;DR
The paper addresses low-level perceptual similarity for full-reference image quality assessment (FR-IQA) without training by leveraging intermediate features from foundation models. It systematically compares intermediate features versus final embeddings from CLIP and DINO backbones across LIVE, TID2013, and PIPAL, using multiple distance measures including $l_2$, cosine, and distribution-based metrics like SKLD, JSD, and WSD. The key finding is that intermediate features yield more accurate and robust similarity scores, with DINOv1-ViT-B emerging as the strongest backbone, particularly on the diverse PIPAL dataset. This training-free approach rivals traditional and learned FR-IQA methods and suggests a practical path for perceptual quality assessment in real-world applications; future work will explore fine-tuning these features for further gains.
Abstract
For full-reference image quality assessment (FR-IQA) using deep-learning approaches, the perceptual similarity score between a distorted image and a reference image is typically computed as a distance measure between features extracted from a pretrained CNN or more recently, a Transformer network. Often, these intermediate features require further fine-tuning or processing with additional neural network layers to align the final similarity scores with human judgments. So far, most IQA models based on foundation models have primarily relied on the final layer or the embedding for the quality score estimation. In contrast, this work explores the potential of utilizing the intermediate features of these foundation models, which have largely been unexplored so far in the design of low-level perceptual similarity metrics. We demonstrate that the intermediate features are comparatively more effective. Moreover, without requiring any training, these metrics can outperform both traditional and state-of-the-art learned metrics by utilizing distance measures between the features.
