HECTOR: Hybrid Editable Compositional Object References for Video Generation

Guofeng Zhang; Angtian Wang; Jacob Zhiyuan Fang; Liming Jiang; Haotian Yang; Alan Yuille; Chongyang Ma

HECTOR: Hybrid Editable Compositional Object References for Video Generation

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang Ma

TL;DR

This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references, and achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

Abstract

Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

HECTOR: Hybrid Editable Compositional Object References for Video Generation

TL;DR

Abstract

Paper Structure (40 sections, 7 equations, 5 figures, 2 tables)

This paper contains 40 sections, 7 equations, 5 figures, 2 tables.

Introduction
Related Works
Foundational video generation model.
Reference-based video customization.
Trajectory-controlled video generation.
Method
Preliminaries
Diffusion Transformers (DiTs).
Image-conditional video generation.
Trajectory-grounded motion modeling.
Video Decompositor
Video captioning.
Object identification and anchor points sampling.
Reference trajectory extraction.
HECTOR
...and 25 more sections

Figures (5)

Figure 1: We propose HECTOR, a compositional, reference-guided video generation architecture. HECTOR supports conditioning on heterogeneous reference inputs (static images and/or dynamic videos) while enabling precise control over each referenced element’s location, scale, and speed. Beyond that, HECTOR also accommodates diverse operations, including multi-object composition, camera-motion control (e.g., zoom-in/zoom-out), and reference-driven video editing such as object insertion, replacement as shown in the above.
Figure 2: Pipeline of the Video Decompositor, which extracts video composition alongside dynamic and static references from a video. Specifically, Video SAM is first used to segment elements from the footage. Depending on the entity size, we place one or multiple anchor points on each object. A point tracking method is then used to propagate these selected anchors over time. We design a reference trajectory extraction method that converts the anchor tracks into a composition layout, capturing both the scale and translation of the entity. Finally, we crop each object from the original video using the computed spatial parameters to serve as the reference.
Figure 3: Overview of the HECTOR framework, which accepts hybrid inputs—static images and dynamic video references—alongside user-defined spatiotemporal layouts. The Spatio-Temporal Alignment Module (STAM) projects these references into the latent space using dynamic Gaussian masks to create aligned feature conditions. These conditions guide the DiT backbone to synthesize a unified video that preserves reference fidelity while strictly adhering to the specified motion trajectories.
Figure 4: Qualitative comparison against baselines. We evaluate static reference-controlled video generation, as baselines are limited to this modality. The left column displays the source reference objects; for a fair experimental setup, we apply masks to crop the objects, ensuring all approaches receive only the object appearance without background context. The right columns show the resulting generated videos, illustrating the visual quality and precise spatial alignment with the input bounding box trajectories.
Figure 5: Qualitative results for video reference. We demonstrate our framework's versatility adopting video-based reference through three distinct applications: (a) Object Replacement, seamlessly transferring a reference object's identity onto a moving subject, (b) Compositional Multi-Subject Generation, where distinct video references independently control separate entities, and (c) Background-Locked Motion Editing, enabling precise foreground manipulations while keeping the background region frozen.

HECTOR: Hybrid Editable Compositional Object References for Video Generation

TL;DR

Abstract

HECTOR: Hybrid Editable Compositional Object References for Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)