Table of Contents
Fetching ...

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyung Kim, Seungryong Kim, Jin-Hwa Kim

TL;DR

The paper introduces a diffusion-based framework for pose-free, few-shot novel view synthesis that treats unseen-view generation as warping and inpainting of partial geometry predicted from unposed references. It keyly couples image and geometry outputs through cross-modal attention distillation (MoAI) and enhances geometric conditioning with proximity-based mesh conditioning to produce aligned images and 3D geometry, even in extrapolative settings. The approach operates with multi-view aggregation, aggregating attention across reference views to guide both image and geometry diffusion paths, and demonstrates strong performance and robustness across multiple datasets, including Co3D, RealEstate10K, and DTU. This yields high-fidelity novel views and aligned colored point clouds for 3D completion, offering a practical, generalizable solution for geometry-consistent NVS without requiring calibrated poses or per-scene optimization.

Abstract

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

TL;DR

The paper introduces a diffusion-based framework for pose-free, few-shot novel view synthesis that treats unseen-view generation as warping and inpainting of partial geometry predicted from unposed references. It keyly couples image and geometry outputs through cross-modal attention distillation (MoAI) and enhances geometric conditioning with proximity-based mesh conditioning to produce aligned images and 3D geometry, even in extrapolative settings. The approach operates with multi-view aggregation, aggregating attention across reference views to guide both image and geometry diffusion paths, and demonstrates strong performance and robustness across multiple datasets, including Co3D, RealEstate10K, and DTU. This yields high-fidelity novel views and aligned colored point clouds for 3D completion, offering a practical, generalizable solution for geometry-consistent NVS without requiring calibrated poses or per-scene optimization.

Abstract

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

Paper Structure

This paper contains 31 sections, 6 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Overview of our diffusion-based framework. From one or more unposed reference images, we predict a partial colored point cloud and project it to the target view. Our diffusion model then inpaints missing regions with the cross-Modal Attention Instillation (MoAI), ensuring alignment between image and geometry, resulting in a complete 3D scene.
  • Figure 2: Training methodology. Our method conducts cross-modal attention instillation, replacing the spatial attention maps of geometry denoising networks with those of image denoising networks, so that the image generation U-Net learns a more robust representation aligned with the geometry completion task. On the other hand, the geometry prediction networks leverage the rich semantics from image features to enhance geometry completion capability.
  • Figure 3: Effects of cross-modal attention instillation.
  • Figure 4: Qualitative results. We demonstrate our qualitative results on the Co3D reizenstein2021common dataset, conducting NVS while generating aligned geometry robustly and consistently.
  • Figure 5: Qualitative comparison with inpainting method on DTU zhou2018stereo dataset. Our qualitative comparison with the naive warping-and-inpainting method demonstrates our model's zero-shot generalization capabilities to unseen data, as well as its ability to robustly handle erroneous warped geometries for geometrically consistent generation.
  • ...and 13 more figures