Table of Contents
Fetching ...

CLPIPS: A Personalized Metric for AI-Generated Image Similarity

Khoi Trinh, Jay Rothenberger, Scott Seidenberger, Dimitrios Diochnos, Anindya Maiti

Abstract

Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.

CLPIPS: A Personalized Metric for AI-Generated Image Similarity

Abstract

Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Workflow of CLPIPS fine-tuning. Human-generated and ranked images are converted into pairwise tuples to fine-tune LPIPS using a margin ranking loss, producing CLPIPS, which is then evaluated against human similarity rankings.
  • Figure 2: An example of the two main survey tasks.
  • Figure 3: Comparison of human similarity rankings with LPIPS and CLPIPS rankings for the same set of generated images relative to a target image. Images are indexed according to their human-assigned rank (shown in the left column). The same index numbers are shown next to the corresponding images in the LPIPS and CLPIPS columns, indicating where each human-ranked image appears in the metric-based orderings. CLPIPS places images closer to their human-ranked positions than LPIPS, exhibiting fewer rank inversions. In this example, CLPIPS achieved an ICC of $0.45$ while LPIPS achieved an ICC of $-1.41$.

Theorems & Definitions (2)

  • Example 1
  • proof : Justification Sketch for Example \ref{['example:metrics']}