Table of Contents
Fetching ...

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang

TL;DR

OpenSubject tackles identity fidelity and context diversity in subject-driven image generation by building a large, video-derived corpus constructed through a four-stage pipeline that leverages cross-frame identity priors. The dataset enables robust single- and multi-subject conditioning and introduces OSBench, a four-task benchmark evaluated with rubriced VLM judgments. Empirically, finetuning on OpenSubject yields significant gains in identity fidelity and manipulation robustness across OSBench and external benchmarks, especially in complex multi-subject scenes. The work also details practical implementation guidelines and ethical considerations for data licensing and use.

Abstract

Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

TL;DR

OpenSubject tackles identity fidelity and context diversity in subject-driven image generation by building a large, video-derived corpus constructed through a four-stage pipeline that leverages cross-frame identity priors. The dataset enables robust single- and multi-subject conditioning and introduces OSBench, a four-task benchmark evaluated with rubriced VLM judgments. Empirically, finetuning on OpenSubject yields significant gains in identity fidelity and manipulation robustness across OSBench and external benchmarks, especially in complex multi-subject scenes. The work also details practical implementation guidelines and ethical considerations for data licensing and use.

Abstract

Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.

Paper Structure

This paper contains 33 sections, 28 figures, 7 tables, 3 algorithms.

Figures (28)

  • Figure 1: OpenSubject examples illustrating single- and multi-subject driven image generation and manipulation across human, object, and cartoon domains, spanning indoor and outdoor scenes and diverse viewpoints, and highlighting identity fidelity and contextual diversity.
  • Figure 2: Overview of the OpenSubject pipeline. (a) Video curation: collect videos from OpenHumanVid, OpenVid, and OpenS2V, and apply resolution and aesthetic filters. (b) Cross-frame subject mining and pairing: verify objects with a vision–language model (category consensus, visual clarity, occlusion, facial visibility), localize with Grounding-DINO, and select diverse frame pairs. (c) Identity-preserving reference synthesis: use mask-guided outpainting for generation and box-guided inpainting for manipulation. (d) Automated verification and captioning: perform VLM-based artifact checks and regenerate failures, then produce short and long captions for training.
  • Figure 3: Dataset statistics of OpenSubject. (a) Spatial resolution distributions. (b) Distribution of the number of references per sample. (c) Word cloud for subjects. (d) Task composition across four sub-tasks. (e) Subject category frequency.
  • Figure 4: Qualitative comparison. Colored dashed boxes mark regions of interest for comparison. Boxes of the same color denote corresponding regions across methods and refer to the related area in the input image.
  • Figure 5: Prompt for object extraction.
  • ...and 23 more figures