Table of Contents
Fetching ...

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Binyuan Huang, Yuning Lu, Weinan Jia, Hualiang Wang, Mu Liu, Daiqing Yang

Abstract

Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Abstract

Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.

Paper Structure

This paper contains 15 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of different strategies for multi-reference, multi-shot video generation. (A) Independent single-shot reference-to-video generation produces each shot separately, leading to inconsistent backgrounds and appearance details across shots. (B) Joint multi-shot reference-to-video generation improves global coherence, but without explicit side information, the model may associate a shot with the wrong reference, causing identity confusion. (C) Our PoCo with SideInfo-RoPE enables accurate shot-reference association, yielding consistent identity and background across shots. Right: spatially averaged self-attention over concatenated reference and shot tokens. $\mathrm{AttnScore}(\mathrm{Shot}_i \rightarrow \mathrm{Ref}_j)$ denotes the mean attention from $\mathrm{Shot}_i$ to $\mathrm{Ref}_j$.
  • Figure 2: We propose a multi-reference, multi-shot video generation model conditioned on reference images and per-shot captions. (a) The overall architecture integrates reference images, shot captions, and latent video features through VAE and MultiShot-DiT blocks. Each block contains Hierarchical Cross-Attention (b) and Self-Attention with SideInfo-RoPE (c). (b) The hierarchical mask allows reference tokens to attend to all captions, while video tokens in each shot attend only to their corresponding text segment. (c) SideInfo-RoPE assigns reference-specific phase codes in the rotary embedding space, so that temporally aligned shots inherit the corresponding phase patterns. Colored planes denote active rotations, while gray planes denote unrotated ones.
  • Figure 3: Data pipeline for multi-reference multi-shot video generation. The pipeline transforms raw long videos into multi-shot training samples. It includes video processing (quality filtering, shot segmentation, watermark removal, caption generation) and reference construction (face detection, ID clustering, background removal, and seedream-enhanced reference synthesis). These steps ensure clean, consistent identities and high-quality supervision for video generation.
  • Figure 4: Qualitative effect of SideInfo-RoPE on shot-level identity grounding. We test two pairs of visually similar female and male characters using descriptions associated with two reference portraits (@character1, @character2). Without SideInfo-RoPE, the model often exhibits incorrect or ambiguous identity grounding, including both identity swaps and failure cases that do not clearly match either reference. With SideInfo-RoPE, the intended reference-shot correspondence is preserved more reliably. Colored boxes indicate grounding to the corresponding reference identity, while black boxes denote ambiguous or failed grounding.
  • Figure 5: Comparison with commercial reference-to-video methods under the same text prompts and identity references. Compared with Kling-1.6 and Vidu-Q2, PoCo achieves better cross-shot continuity and cross-view consistency, preserving identity, scene layout, lighting, and fine-grained appearance more faithfully.
  • ...and 1 more figures