Table of Contents
Fetching ...

HECTOR: Hybrid Editable Compositional Object References for Video Generation

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang Ma

TL;DR

This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references, and achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

Abstract

Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

HECTOR: Hybrid Editable Compositional Object References for Video Generation

TL;DR

This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references, and achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

Abstract

Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.
Paper Structure (40 sections, 7 equations, 5 figures, 2 tables)

This paper contains 40 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We propose HECTOR, a compositional, reference-guided video generation architecture. HECTOR supports conditioning on heterogeneous reference inputs (static images and/or dynamic videos) while enabling precise control over each referenced element’s location, scale, and speed. Beyond that, HECTOR also accommodates diverse operations, including multi-object composition, camera-motion control (e.g., zoom-in/zoom-out), and reference-driven video editing such as object insertion, replacement as shown in the above.
  • Figure 2: Pipeline of the Video Decompositor, which extracts video composition alongside dynamic and static references from a video. Specifically, Video SAM is first used to segment elements from the footage. Depending on the entity size, we place one or multiple anchor points on each object. A point tracking method is then used to propagate these selected anchors over time. We design a reference trajectory extraction method that converts the anchor tracks into a composition layout, capturing both the scale and translation of the entity. Finally, we crop each object from the original video using the computed spatial parameters to serve as the reference.
  • Figure 3: Overview of the HECTOR framework, which accepts hybrid inputs—static images and dynamic video references—alongside user-defined spatiotemporal layouts. The Spatio-Temporal Alignment Module (STAM) projects these references into the latent space using dynamic Gaussian masks to create aligned feature conditions. These conditions guide the DiT backbone to synthesize a unified video that preserves reference fidelity while strictly adhering to the specified motion trajectories.
  • Figure 4: Qualitative comparison against baselines. We evaluate static reference-controlled video generation, as baselines are limited to this modality. The left column displays the source reference objects; for a fair experimental setup, we apply masks to crop the objects, ensuring all approaches receive only the object appearance without background context. The right columns show the resulting generated videos, illustrating the visual quality and precise spatial alignment with the input bounding box trajectories.
  • Figure 5: Qualitative results for video reference. We demonstrate our framework's versatility adopting video-based reference through three distinct applications: (a) Object Replacement, seamlessly transferring a reference object's identity onto a moving subject, (b) Compositional Multi-Subject Generation, where distinct video references independently control separate entities, and (c) Background-Locked Motion Editing, enabling precise foreground manipulations while keeping the background region frozen.