Table of Contents
Fetching ...

Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks

Amit Zalcher, Navve Wasserman, Roman Beliy, Oliver Heinimann, Michal Irani

TL;DR

PerceptCLIP addresses the challenge of data-scarce perceptual tasks by leveraging CLIP as a rich perceptual prior and applying lightweight LoRA-based tuning to the vision encoder, avoiding task-specific architectural changes. The approach achieves state-of-the-art performance across image memorability, no-reference image quality assessment, and visual emotion analysis, demonstrating strong cross-dataset generalization. A two-stage multi-dataset training strategy further boosts performance on small datasets by sharing CLIP representations while maintaining dataset-specific heads. The work highlights the latent perceptual knowledge encoded in CLIP, enabling efficient, unified modeling of subjective visual judgments with practical impact for multimedia analysis and marketing applications.

Abstract

Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.

Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks

TL;DR

PerceptCLIP addresses the challenge of data-scarce perceptual tasks by leveraging CLIP as a rich perceptual prior and applying lightweight LoRA-based tuning to the vision encoder, avoiding task-specific architectural changes. The approach achieves state-of-the-art performance across image memorability, no-reference image quality assessment, and visual emotion analysis, demonstrating strong cross-dataset generalization. A two-stage multi-dataset training strategy further boosts performance on small datasets by sharing CLIP representations while maintaining dataset-specific heads. The work highlights the latent perceptual knowledge encoded in CLIP, enabling efficient, unified modeling of subjective visual judgments with practical impact for multimedia analysis and marketing applications.

Abstract

Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.

Paper Structure

This paper contains 35 sections, 1 equation, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Our Framework: (a) Perceptual tasks rely on subjective human judgment. (b) Illustration of CLIP’s training samples, which includes human-written captions. These human-generated annotations contain not only factual image descriptions, but inevitably also human sentiments, preferences and emotions. This suggests that CLIP can serve as a prior for perceptual tasks. (c) Our approach leverages CLIP’s prior knowledge to address multiple perceptual tasks with minimal task-specific adaptation (d) We achieve state-of-the-art performance across three distinct perceptual tasks. CB refers for the current best method in each task (see Tables \ref{['tab:IQA_single']},\ref{['tab:MEM_single']},\ref{['tab:EMO_single']} for numerical scores).
  • Figure 2: Unified Framework for Perceptual Tasks: Leveraging the CLIP vision encoder, following an MLP head, our architecture maintains a simple, shared structure across diverse perceptual tasks. With lightweight LoRA adaptation, it fine-tunes efficiently for each task independently, effectively exploiting CLIP’s prior perceptual knowledge.
  • Figure 3: Visual Perceptual Tasks.
  • Figure 4: Attention Shift Toward Perceptual Cues. We present images along with the differences in their attention maps between our PerceptCLIP model and the pretrained CLIP vision encoder (displaying results from critical attention heads that most influence the perceptual predictions). This highlights the shift in attention, revealing how our model reallocates focus to perceptually meaningful regions.
  • Figure S1: Attention Mask Visualization - Positive Emotions. We present images alongside their corresponding attention maps from the pretrained CLIP vision encoder and our PerceptCLIP model (displaying results from critical attention heads that most influence the perceptual predictions). This highlights the shift in attention, revealing how our model reallocates focus to perceptually meaningful regions.
  • ...and 1 more figures