Table of Contents
Fetching ...

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Dani Lischinski, Idan Szpektor

TL;DR

RefVNLI introduces a scalable, dual-output auto-rater for subject-driven T2I generation that jointly evaluates textual alignment and subject preservation. It trains a 3B Vision-Language Model (PaliGemma) on a large, auto-generated dataset of <imageref, prompt, imagetgt> triplets, producing two binary scores in a single pass. Across DreamBench++, ImagenHub, KITTEN, and ImageRAG benchmarks, RefVNLI matches or surpasses baselines, including GPT-4o-based metrics, with strong performance on rare subjects and robustness to identity-agnostic changes. This approach reduces API reliance and offers a scalable, reproducible tool to guide subject-driven image generation and evaluation.

Abstract

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability - ranging from enhanced personalization in image generation to consistent character representation in video rendering - progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation.

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

TL;DR

RefVNLI introduces a scalable, dual-output auto-rater for subject-driven T2I generation that jointly evaluates textual alignment and subject preservation. It trains a 3B Vision-Language Model (PaliGemma) on a large, auto-generated dataset of <imageref, prompt, imagetgt> triplets, producing two binary scores in a single pass. Across DreamBench++, ImagenHub, KITTEN, and ImageRAG benchmarks, RefVNLI matches or surpasses baselines, including GPT-4o-based metrics, with strong performance on rare subjects and robustness to identity-agnostic changes. This approach reduces API reliance and offers a scalable, reproducible tool to guide subject-driven image generation and evaluation.

Abstract

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability - ranging from enhanced personalization in image generation to consistent character representation in video rendering - progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation.

Paper Structure

This paper contains 31 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration of RefVNLI: Given a reference image of a subject, a prompt referring to the subject, and a target image, RefVNLI assesses both subject preservation and textual alignment. For subject preservation, it distinguishes identity-preserving variations, like dew on a flower (top image), from identity-altering changes, such as color change (middle image). For textual alignment, it assesses whether the target image reflects all details from the prompt, such as the fence’s position relative to the flower (bottom image).
  • Figure 2: Qualitative Comparison: We compare RefVNLI with DreamBench++ and CLIP, which score both Subject Preservation (SP) and Textual Alignment (TA), using examples from the Animal, Object, and Human categories. DreamBench++ scores (0-4) are scaled to 0-100 for better readability. RefVNLI exhibits better robustness to identity-agnostic changes (SP), such as the zoomed-out parrot (top-middle) and the zoomed-out person with different attire (bottom-middle). It is also more sensitive to identity-defining traits, penalizing changed facial features (left-most person) and mismatched object patterns (left and middle balloons). Additionally, RefVNLI excels at detecting text-image mismatches (TA), as seen in its penalization of the top-left image for lacking a waterfall.
  • Figure 3: Generating subject preservation classification training instances from video frames. Given two pairs of frames, each extracted from distinct video scenes featuring the same entity (e.g., a dog), where both frames within each pair depict the same subject (e.g., the same dog), we construct training {imageref, imagetgt} pairs for subject preservation classification. Positive pairs are formed by pairing a cropped subject from one frame (e.g., dog from left frame in Scene 1) with the full frame from the same scene (right frame in Scene 1). In contrast, negative pairs are created by pairing the cropped subject with the other scene's full frames (e.g., Scene 2). This process is applied to all four frames, with each taking turns as the cropped reference image (imageref), while the corresponding full-frame counterparts serve as imagetgt, yielding a total of 4 positive and 8 negative training pairs.
  • Figure 4: Creating identity-sensitive {imageref, imagetgt} pairs. Starting with an image and a mask of a subject (e.g., a bag), we randomly keep 5 patches within the masked area ([1]) and use them to create 5 inpainted versions ([2]). The version with the highest MSE between the altered and original areas (e.g., bottom image, MSE = 3983) is paired with the unmodified crop to form a negative pair, while the original image and the same crop create a positive pair, with the crop acting as imageref in both cases.
  • Figure 5: Example of prompt-imagetgt pairs. Given an image with some subject (e.g., a dog), we create a positive prompt by adding a bounding box around the subject and directing Gemini to describe it (top prompts). Negative prompts are created by swapping prompts between images of the same entity (middle prompts). For additional hard negatives, we guide Gemini to modify a single non-subject detail in the positive prompt while keeping the rest unchanged (bottom prompts).
  • ...and 7 more figures