Table of Contents
Fetching ...

LibraGen: Playing a Balance Game in Subject-Driven Video Generation

Jiahao Zhu, Shanshan Lao, Lijie Liu, Gen Li, Tianhao Qi, Wei Han, Bingchuan Li, Fangfang Liu, Zhuowei Chen, Tianxiang Ma, Qian HE, Yi Zhou, Xiaohua Xie

Abstract

With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.

LibraGen: Playing a Balance Game in Subject-Driven Video Generation

Abstract

With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.
Paper Structure (17 sections, 5 equations, 8 figures, 3 tables)

This paper contains 17 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: A balance game in S2V generation. (a) T2V/I2V foundation models lack task-specific training data and thus exhibit poor S2V performance. (b) Previous S2V methods trained solely on in-pair data or (c) solely on cross-pair data often overlook the inherent balance trade-off. (d) LibraGen frames S2V generation as a balance game, achieving superior and well-balanced S2V performance.
  • Figure 2: We present LibraGen, a novel training paradigm that extends VGFMs to support both single-subject and multi-subject driven video generation.
  • Figure 3: Training and inference pipelines of LibraGen. During training, reference-related embeddings are concatenated with video frame embeddings along the temporal dimension, accompanied by dedicated flags. During inference, the user prompt is processed by a multi-stage prompt rephraser to better align with the training captions. The base MM-DiT produces 480P outputs, which can be further refined to 720P.
  • Figure 4: Human-aligned data curation pipeline. The automatic–manual hybrid data curation pipeline follows a quality-over-quantity strategy and consists of four stages: video collection, reference subject extraction, data captioning, and data labeling with dynamic updates.
  • Figure 5: Tune-to-Balance post-training paradigm.
  • ...and 3 more figures