You Only Submit One Image to Find the Most Suitable Generative Model

Zhi Zhou; Lan-Zhe Guo; Peng-Xiao Song; Yu-Feng Li

You Only Submit One Image to Find the Most Suitable Generative Model

Zhi Zhou, Lan-Zhe Guo, Peng-Xiao Song, Yu-Feng Li

TL;DR

This work introduces Generative Model Identification (GMI), a framework enabling users to efficiently identify the most suitable among many diffusion-based generators using only a single example image. The approach combines a weighted Reduced Kernel Mean Embedding (RKME) to capture generated image distributions and image-prompt relationships, a large pre-trained vision-language model to map images and prompts into a shared feature space, and an image interrogator to bridge cross-modality gaps. A two-stage pipeline—submitting stage for model specifications and identification stage for user requirements—produces similarity scores that rank candidate models; empirical results on a 16-model diffusion benchmark show the method achieves over 80% Top-4 accuracy. The work provides a practical, scalable path toward precise model discovery and efficiency on model hubs, with potential acceleration via vector databases and further improvements through richer specifications and prompts.

Abstract

Deep generative models have achieved promising results in image generation, and various generative model hubs, e.g., Hugging Face and Civitai, have been developed that enable model developers to upload models and users to download models. However, these model hubs lack advanced model management and identification mechanisms, resulting in users only searching for models through text matching, download sorting, etc., making it difficult to efficiently find the model that best meets user requirements. In this paper, we propose a novel setting called Generative Model Identification (GMI), which aims to enable the user to identify the most appropriate generative model(s) for the user's requirements from a large number of candidate models efficiently. To our best knowledge, it has not been studied yet. In this paper, we introduce a comprehensive solution consisting of three pivotal modules: a weighted Reduced Kernel Mean Embedding (RKME) framework for capturing the generated image distribution and the relationship between images and prompts, a pre-trained vision-language model aimed at addressing dimensionality challenges, and an image interrogator designed to tackle cross-modality issues. Extensive empirical results demonstrate the proposal is both efficient and effective. For example, users only need to submit a single example image to describe their requirements, and the model platform can achieve an average top-4 identification accuracy of more than 80%.

You Only Submit One Image to Find the Most Suitable Generative Model

TL;DR

Abstract

You Only Submit One Image to Find the Most Suitable Generative Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)

Theorems & Definitions (2)