Table of Contents
Fetching ...

CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Minseop Kwak, Jin-Hwa Kim, Seungryong Kim

TL;DR

The paper addresses how to enforce view-consistency in multi-view diffusion models for novel view synthesis by revealing that cross-view geometric correspondence emerges in attention maps during training. It introduces CAMEO, a simple supervision technique that aligns cross-view attention with explicit geometric correspondences using an MLP head and a correspondence map, achieving substantially faster convergence and higher-quality syntheses. The method is demonstrated to be model-agnostic, improving performance across CAT3D, MVGenMaster, and a DiT-based model, and maintains geometry even under challenging viewpoints. These findings provide a practical pathway to incorporate geometry-aware supervision in diffusion-based NVS, with potential extensions to other multi-view and 4D tasks.

Abstract

Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.

CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

TL;DR

The paper addresses how to enforce view-consistency in multi-view diffusion models for novel view synthesis by revealing that cross-view geometric correspondence emerges in attention maps during training. It introduces CAMEO, a simple supervision technique that aligns cross-view attention with explicit geometric correspondences using an MLP head and a correspondence map, achieving substantially faster convergence and higher-quality syntheses. The method is demonstrated to be model-agnostic, improving performance across CAT3D, MVGenMaster, and a DiT-based model, and maintains geometry even under challenging viewpoints. These findings provide a practical pathway to incorporate geometry-aware supervision in diffusion-based NVS, with potential extensions to other multi-view and 4D tasks.

Abstract

Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.

Paper Structure

This paper contains 33 sections, 8 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Correspondence-attention alignment makes multi-view diffusion training effective. Our framework, CAMEO, aligns attention maps of the multi-view diffusion models cat3dmvgenmasterli2024hunyuandit with geometric correspondence. In experiments, CAMEO produces geometrically consistent novel views even in challenging scenarios involving large viewpoint changes or complex geometry.
  • Figure 2: Attention maps in multi-view diffusion models and geometric correspondence map: (a) Multi-view diffusion models and their 3D self-attention maps cat3dmvgenmaster. (b) Attention vs. geometric correspondence map. The attention map of layer $l=10$ in CAT3D cat3d naturally focuses on its geometric counterpart across views even without explicit supervision.
  • Figure 3: Layer-wise behavior of the multi-view diffusion model (CAT3D cat3d)'s attention map. For each query point on the target image, model's maximum attending point in the reference image is marked with the same color as the query point. Attention map of layer $l=10$ cleary attends to geometrically corresponding point, while other layers do not. We fix a timestep $t=999$ (i.e., complete noise).
  • Figure 4: Effect of layer-wise attention perturbation. Following the perturbation procedure of PAG pag, perturbing earlier layers barely changes generation quality, while perturbing layer 10 collapses geometric consistency and severely degrades quality.
  • Figure 5: Analysis of geometric correspondence in attention maps of the multi-view diffusion model cat3d. (a) Correspondence precision across attention layers ($l = 2, 4, 6, 7, 10$), with other baselines ldm5551153wang2025vggtsimeoni2025dinov3. (b) The correspondence precision of layer $l=10$ with baselines, across viewpoint rotation. (c) The correspondence precision of layer $l=10$ improves during training.
  • ...and 14 more figures