Table of Contents
Fetching ...

SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Fengming Liu, Tat-Jen Cham, Chuanxia Zheng

TL;DR

SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts, is presented and DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts is designed.

Abstract

Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.

SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

TL;DR

SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts, is presented and DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts is designed.

Abstract

Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.
Paper Structure (42 sections, 19 equations, 14 figures, 4 tables)

This paper contains 42 sections, 19 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Examples of spatial relationship reasoning. Given a prompt specifying the spatial relationships (top), our fine-tuned model (right) generates videos that correctly reflect the desired change in spatial relationships, while the baseline model (left) fails to do so.
  • Figure 2: Overview of SpatialAlign. (A) Given a text prompt, the pre-trained T2V model generates several video samples. For each sample, we first use GroundedSAM to obtain the bboxes of the animal and the object in each frame. (B) Then, for each frame, we compute the Static Spatial Relationship (SSR) Score based on the bboxes. From the SSR Score Sequence (all frames), we derive four metric components that are aggregated into the DSR-Score, which quantifies how well the video aligns with the dynamic spatial relationship (DSR). (C) During DPO training, we identify winner/loser pairs based on the DSR-Score using a threshold. We then train a LoRA to enhance the model's ability to accurately represent DSR in generated videos, using our proposed zero-order regularized DPO.
  • Figure 3: Ablation on loss. With only the DPO loss $\mathcal{L}_\text{DPO}$, the model degrades after 800 steps. With SFT loss $\mathcal{L}_\text{SFT}$, the color saturation is strong. With our zeroth-order regularization $\mathcal{L}_\text{ZO}$, the training is stable.
  • Figure 4: VLM vsDSR-Score. Each plot contains bins at each SSR Score value. Each bin shows the portion where the VLM gives the YES answer or NO answer at the particular SSR Score location. The VLM gives a significant portion of YES on the low SSR Score interval, especially for the evaluation on the final SSR.
  • Figure 5: Qualitative comparisons with the state-of-the-art T2V models on DSR-Dataset. Our fine-tuned model (right columns) generates videos that correctly reflect the desired change in spatial relationships, while the baseline models (left columns) fail to do so.
  • ...and 9 more figures