Table of Contents
Fetching ...

Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

Ido Sobol, Chenfeng Xu, Or Litany

TL;DR

This work proposes Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3, and implements a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity.

Abstract

Generating realistic images from arbitrary views based on a single source image remains a significant challenge in computer vision, with broad applications ranging from e-commerce to immersive virtual experiences. Recent advancements in diffusion models, particularly the Zero-1-to-3 model, have been widely adopted for generating plausible views, videos, and 3D models. However, these models still struggle with inconsistencies and implausibility in new views generation, especially for challenging changes in viewpoint. In this work, we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity. This process improves geometric consistency without requiring retraining or significant computational resources. Additionally, we modify the self-attention mechanism to integrate information from the source view, reducing shape distortions. These processes are further supported by a specialized sampling schedule. Experimental results demonstrate substantial improvements in fidelity and consistency, validated on a diverse set of out-of-distribution objects. Additionally, we demonstrate the general applicability and effectiveness of Zero-to-Hero in multi-view, and image generation conditioned on semantic maps and pose.

Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

TL;DR

This work proposes Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3, and implements a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity.

Abstract

Generating realistic images from arbitrary views based on a single source image remains a significant challenge in computer vision, with broad applications ranging from e-commerce to immersive virtual experiences. Recent advancements in diffusion models, particularly the Zero-1-to-3 model, have been widely adopted for generating plausible views, videos, and 3D models. However, these models still struggle with inconsistencies and implausibility in new views generation, especially for challenging changes in viewpoint. In this work, we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity. This process improves geometric consistency without requiring retraining or significant computational resources. Additionally, we modify the self-attention mechanism to integrate information from the source view, reducing shape distortions. These processes are further supported by a specialized sampling schedule. Experimental results demonstrate substantial improvements in fidelity and consistency, validated on a diverse set of out-of-distribution objects. Additionally, we demonstrate the general applicability and effectiveness of Zero-to-Hero in multi-view, and image generation conditioned on semantic maps and pose.
Paper Structure (30 sections, 1 equation, 16 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 1 equation, 16 figures, 10 tables, 1 algorithm.

Figures (16)

  • Figure 1: Novel views generated from a single source image (far left column) at a specific target view angle (with different seeds), compared between Zero123-XL liu2023zero1to3 and our Zero-to-Hero method. Operating during inference, our method achieves significantly higher fidelity and maintains authenticity to the original image, all while ensuring realistic variation in the results (e.g. variations in chair backs in the top row). The ground-truth target view is displayed in the far right column.
  • Figure 2: Zero-to-Hero main modules. (Left) Two denoising steps of the generation process of both the source (top) and target views (bottom). Each denoising step is iterated $R$ times ("resampling"). (Right-top) Attention map filtering: Robustifying attention maps via an aggregation of same step and previous steps attention maps. (Right-bottom) Mutual self-attention: Guiding target shape through the keys and values of the source generation branch.
  • Figure 3: Cross-Attention in Zero-1-to-3. (Left) The cross-attention map before applying softmax. (Right) The degenerated all-ones attention map, produced by applying softmax on the left map.
  • Figure 4: Through the injection of ground-truth attention maps extracted from the target view, we demonstrate that Self-attention maps are key to robust view synthesis.
  • Figure 5: From SGD to Diffusion Models: An illustration of our conceptual analogy.
  • ...and 11 more figures