Table of Contents
Fetching ...

MagicView: Multi-View Consistent Identity Customization via Priors-Guided In-Context Learning

Hengjia Li, Jianjin Xu, Keli Cheng, Lei Wang, Ning Bi, Boxi Wu, Fernando De la Torre, Deng Cai

TL;DR

MagicView addresses the challenge of achieving multi-view identity-consistent customization from a single photograph by introducing a 3D priors-guided in-context learning framework for DiT-based models. The method uses in-context depth maps derived from SMPL and PuLID to activate multi-view reasoning and employs a Semantic Correspondence Alignment loss to preserve semantic controllability under limited data. With only 100 training samples, MagicView achieves superior multi-view consistency, identity fidelity, and prompt alignment compared with stronger baselines, while remaining data-efficient and test-time tuning-free. The approach offers a practical pathway to high-quality, view-coherent personalized imagery and has potential extensions to 3D modeling and reconstruction. Overall, MagicView combines lightweight adaptation, 3D priors, and semantic-preserving finetuning to deliver robust, controllable multi-view identity customization.

Abstract

Recent advances in personalized generative models have demonstrated impressive capabilities in producing identity-consistent images of the same individual across diverse scenes. However, most existing methods lack explicit viewpoint control and fail to ensure multi-view consistency of generated identities. To address this limitation, we present MagicView, a lightweight adaptation framework that equips existing generative models with multi-view generation capability through 3D priors-guided in-context learning. While prior studies have shown that in-context learning preserves identity consistency across grid samples, its effectiveness in multi-view settings remains unexplored. Building upon this insight, we conduct an in-depth analysis of the multi-view in-context learning ability, and design a conditioning architecture that leverages 3D priors to activate this capability for multi-view consistent identity customization. On the other hand, acquiring robust multi-view capability typically requires large-scale multi-dimensional datasets, which makes incorporating multi-view contextual learning under limited data regimes prone to textual controllability degradation. To address this issue, we introduce a novel Semantic Correspondence Alignment loss, which effectively preserves semantic alignment while maintaining multi-view consistency. Extensive experiments demonstrate that MagicView substantially outperforms recent baselines in multi-view consistency, text alignment, identity similarity, and visual quality, achieving strong results with only 100 multi-view training samples.

MagicView: Multi-View Consistent Identity Customization via Priors-Guided In-Context Learning

TL;DR

MagicView addresses the challenge of achieving multi-view identity-consistent customization from a single photograph by introducing a 3D priors-guided in-context learning framework for DiT-based models. The method uses in-context depth maps derived from SMPL and PuLID to activate multi-view reasoning and employs a Semantic Correspondence Alignment loss to preserve semantic controllability under limited data. With only 100 training samples, MagicView achieves superior multi-view consistency, identity fidelity, and prompt alignment compared with stronger baselines, while remaining data-efficient and test-time tuning-free. The approach offers a practical pathway to high-quality, view-coherent personalized imagery and has potential extensions to 3D modeling and reconstruction. Overall, MagicView combines lightweight adaptation, 3D priors, and semantic-preserving finetuning to deliver robust, controllable multi-view identity customization.

Abstract

Recent advances in personalized generative models have demonstrated impressive capabilities in producing identity-consistent images of the same individual across diverse scenes. However, most existing methods lack explicit viewpoint control and fail to ensure multi-view consistency of generated identities. To address this limitation, we present MagicView, a lightweight adaptation framework that equips existing generative models with multi-view generation capability through 3D priors-guided in-context learning. While prior studies have shown that in-context learning preserves identity consistency across grid samples, its effectiveness in multi-view settings remains unexplored. Building upon this insight, we conduct an in-depth analysis of the multi-view in-context learning ability, and design a conditioning architecture that leverages 3D priors to activate this capability for multi-view consistent identity customization. On the other hand, acquiring robust multi-view capability typically requires large-scale multi-dimensional datasets, which makes incorporating multi-view contextual learning under limited data regimes prone to textual controllability degradation. To address this issue, we introduce a novel Semantic Correspondence Alignment loss, which effectively preserves semantic alignment while maintaining multi-view consistency. Extensive experiments demonstrate that MagicView substantially outperforms recent baselines in multi-view consistency, text alignment, identity similarity, and visual quality, achieving strong results with only 100 multi-view training samples.

Paper Structure

This paper contains 25 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: MagicView compared to conventional customization methods. MagicView generates personalized images consistent with multiple views given one reference image. Conventional methods like PULID guo2024pulid have limited control over the viewpoint in the prompt (i.e., left, middle, and right view) and do not have multi-view consistency.
  • Figure 2: Overview of MagicView. In step 1, we use SMPL goel2023humans to fit the body mesh corresponding to the sample from the personalized generator guo2024pulid. Then we render the body mesh for multi-view depth maps. With the in-context depth priors, we can generate the multi-view customization images in step 2 using the personalized model with our control Adapter.
  • Figure 3: Overview of Semantic Correspondence Alignment Loss. Specifically, we minimize the L2 distance between semantic correspondences at each layer of the finetuned and pretrained MMDiT models for the same training sample, thereby explicitly constraining the finetuned model to retain the semantic control capabilities learned in pretraining.
  • Figure 4: Qualitative comparison. DiffPortrait3D and Era3D exhibit limitations in maintaining geometric and visual consistency, especially with regard to full-body and background regions. Although ViewCrafter achieves improved scene modeling, it does so at the expense of geometric consistency in human representations. Besides, both BAGEL and Qwen-Image demonstrate suboptimal performance in terms of multi-view control. In contrast, our MagicView achieves superior performance in both geometric fidelity and visual coherence across views. The baseline methods generate results using prompts that explicitly specify viewpoints, such as “left view”, “middle view”, and “right view”.
  • Figure 5: Ablation study for multi-view in-context learning. As shown, transitioning from the pretrained model (second row) to the multi-view in-context learning model (third row) significantly improves cross-view geometric consistency. The addition of the SCA module (fourth row) helps the finetuned model retain the pretrained model's ability to follow prompt-based semantic controls, such as identity, clothing, and background.
  • ...and 1 more figures