Table of Contents
Fetching ...

ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu

TL;DR

This work tackles the challenge of preserving fine-grained object identity in image-to-video generation under changing viewpoints. It introduces ConsIDVid, a scalable object-centric video dataset, and ConsIDVid-Bench, a multi-view identity-focused evaluation framework, to quantify geometric and appearance consistency across views. The authors then propose ConsID-Gen, a view-assisted generation approach with a dual-visual encoder and a text–visual connector that yields unified conditioning for a diffusion transformer backbone. Across proprietary and public subsets, ConsID-Gen demonstrates superior identity fidelity and geometric coherence compared to strong baselines, highlighting the value of explicit multi-view cues and cross-modal alignment for robust I2V. The work provides a data-driven benchmark, a novel model, and comprehensive evaluations that advance practical, identity-preserving I2V generation.

Abstract

Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.

ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

TL;DR

This work tackles the challenge of preserving fine-grained object identity in image-to-video generation under changing viewpoints. It introduces ConsIDVid, a scalable object-centric video dataset, and ConsIDVid-Bench, a multi-view identity-focused evaluation framework, to quantify geometric and appearance consistency across views. The authors then propose ConsID-Gen, a view-assisted generation approach with a dual-visual encoder and a text–visual connector that yields unified conditioning for a diffusion transformer backbone. Across proprietary and public subsets, ConsID-Gen demonstrates superior identity fidelity and geometric coherence compared to strong baselines, highlighting the value of explicit multi-view cues and cross-modal alignment for robust I2V. The work provides a data-driven benchmark, a novel model, and comprehensive evaluations that advance practical, identity-preserving I2V generation.

Abstract

Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.
Paper Structure (38 sections, 2 equations, 16 figures, 8 tables)

This paper contains 38 sections, 2 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Examples Synthesized by ConsID-Gen. Given a textual instruction and reference image containing rigid objects (i.e., rings, diamonds), ConsID-Gen synthesizes realistic videos that faithfully preserve object identity and maintain geometric consistency. The initial row was generated by Wan wan2025wan using the same prompt. Attributes highlighted in red denote object properties specified in the instruction.
  • Figure 2: Comparing Different Video Generation Paradigms. Single-stream (T2V) uses only text tokens as context. Dual-stream (I2V) concatenates text and 2D visual tokens with limited interaction. Hybrid representations (Ours) pre-align text and visual tokens via fine-grained interaction before projection.
  • Figure 3: Data Curation Pipeline. We curate and synthesize videos from diverse sources, followed by an automated data curation pipeline to ensure visual and temporal quality. Video captions are produced by Qwen2.5-VL via a hierarchical captioning strategy.
  • Figure 4: Statistics of video clips in ConsIDVid. The dataset includes diverse distributions of data source and video duration.
  • Figure 5: Overview of ConsID-Gen. The model takes as input the first frame, two uncalibrated images, and a text instruction. Our Dual-Visual Encoder combines a Visual Encoder and a Geometry Encoder to extract visual-appearance and geometric representations. A unified multimodal interaction projector then fuses these features with the prompt to generate conditioning tokens for the DiT backbone.
  • ...and 11 more figures