Table of Contents
Fetching ...

Selecting Fine-Tuning Examples by Quizzing VLMs

Tenghao Ji, Eytan Adar

TL;DR

The paper tackles the problem of data quality in fine-tuning diffusion models for topic-specific generation. It introduces QZLoRA, which combines QuizRank—treating images as an educational intervention for a Vision-Language Model—with LoRA to automatically select the most representative training images, enabling effective fine-tuning with fewer samples. Empirical results show that QuizRank-guided fine-tuning yields higher alignment and stability across topics in both photorealistic and illustration domains, with top-$k$ selections (especially $k=15$) performing best and exhibiting stronger input-output correlations. Additionally, QuizRank serves as a scalable evaluation metric for generated outputs, offering a principled way to compare approaches and quantify improvements in topic-adaptive generation.

Abstract

A challenge in fine-tuning text-to-image diffusion models for specific topics is to select good examples. Fine-tuning from image sets of varying quality, such as Wikipedia Commons, will often produce poor output. However, training images that \textit{do} exemplify the target concept (e.g., a \textit{female Mountain Bluebird}) help ensure that the generated images are similarly representative (e.g., have the prototypical blue-wings and gray chest). In this work, we propose QZLoRA, a framework to select images for low-rank adaptation (LoRA). The approach leverages QuizRank, a method to automatically rank images by treating them as an `educational intervention' and `quizzing' a VLM. We demonstrate that QZLoRA can produce better aligned, photorealistic images with fewer samples. We also show that these fine-tuned models can produce stylized that are similarly representative (i.e., illustrations). Our results highlight the promise of combining automated visual reasoning with parameter-efficient fine-tuning for topic-adaptive generative modeling.

Selecting Fine-Tuning Examples by Quizzing VLMs

TL;DR

The paper tackles the problem of data quality in fine-tuning diffusion models for topic-specific generation. It introduces QZLoRA, which combines QuizRank—treating images as an educational intervention for a Vision-Language Model—with LoRA to automatically select the most representative training images, enabling effective fine-tuning with fewer samples. Empirical results show that QuizRank-guided fine-tuning yields higher alignment and stability across topics in both photorealistic and illustration domains, with top- selections (especially ) performing best and exhibiting stronger input-output correlations. Additionally, QuizRank serves as a scalable evaluation metric for generated outputs, offering a principled way to compare approaches and quantify improvements in topic-adaptive generation.

Abstract

A challenge in fine-tuning text-to-image diffusion models for specific topics is to select good examples. Fine-tuning from image sets of varying quality, such as Wikipedia Commons, will often produce poor output. However, training images that \textit{do} exemplify the target concept (e.g., a \textit{female Mountain Bluebird}) help ensure that the generated images are similarly representative (e.g., have the prototypical blue-wings and gray chest). In this work, we propose QZLoRA, a framework to select images for low-rank adaptation (LoRA). The approach leverages QuizRank, a method to automatically rank images by treating them as an `educational intervention' and `quizzing' a VLM. We demonstrate that QZLoRA can produce better aligned, photorealistic images with fewer samples. We also show that these fine-tuned models can produce stylized that are similarly representative (i.e., illustrations). Our results highlight the promise of combining automated visual reasoning with parameter-efficient fine-tuning for topic-adaptive generative modeling.

Paper Structure

This paper contains 17 sections, 7 figures.

Figures (7)

  • Figure 1: Example outputs for various topics (in rows). From left to right: a real example from the Wikimedia Commons, a generated photorealistic image with no fine-tuning, an image generated with a LoRA using a random sample of Commons images, and a LoRA model tuned with the 2 best QuizRank scored images, and an image from the LoRA tuned on the 15 best images. The QuizRank score is displayed below each image.
  • Figure 2: The QZLoRA approach: A test is generated based on visual properties of the target object by using textual descriptions of the target and distractors (step A). Each possible image and test are fed into a VLM (step B). Images are ranked based on how many questions the VLM answers correctly (step C) and the top-k are fed to LoRA for fine-tuning (step D). A Stable Diffusion model is used to generate new images (step E). Optionally, these images can be ranked (step F) using the original QuizRank test (elements recreated from ji2025quizrankpickingimagesquizzing)
  • Figure 3: Example outputs for various topics (in rows). From left to right: a real example from the Wikimedia Commons, a generated illustrated image with no fine-tuning, an image generated with a LoRA using a random sample of Commons images, and a LoRA model tuned with the 15 best QuizRank scored images. The QuizRank score for each image is displayed below each image.
  • Figure 4: Boxplot of QuizRank accuracy under different conditions. LoRA fine-tuning guided by QuizRank-selected images shows higher median accuracy and reduced variance compared to random or baseline settings.
  • Figure 5: Pairwise net advantage among all conditions. Warm (red) colors denote that the row method performs better on more topics. LoRA-QuizRank (Top 15) achieves the most consistent advantage. The 'Wikipedia' rows/columns reflect the score for real images (non-generated) for the topic in the Wikimedia Commons.
  • ...and 2 more figures