OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Yexin Liu; Manyuan Zhang; Yueze Wang; Hongyu Li; Dian Zheng; Weiming Zhang; Changsheng Lu; Xunliang Cai; Yan Feng; Peng Pei; Harry Yang

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang

TL;DR

OpenSubject tackles identity fidelity and context diversity in subject-driven image generation by building a large, video-derived corpus constructed through a four-stage pipeline that leverages cross-frame identity priors. The dataset enables robust single- and multi-subject conditioning and introduces OSBench, a four-task benchmark evaluated with rubriced VLM judgments. Empirically, finetuning on OpenSubject yields significant gains in identity fidelity and manipulation robustness across OSBench and external benchmarks, especially in complex multi-subject scenes. The work also details practical implementation guidelines and ethical considerations for data licensing and use.

Abstract

Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

TL;DR

Abstract

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (28)