Table of Contents
Fetching ...

Focus on Neighbors and Know the Whole: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation

Bonan Li, Zicheng Zhang, Xingyi Yang, Xinchao Wang

TL;DR

This work addresses the challenge of producing dense, consistent multiview images from text prompts for 3D creation. It introduces CoSER, a hybrid architecture that combines neighbor-focused contextual attention with spiral bidirectional scan-based state-space modeling to enforce both short- and long-range cross-view coherence while maintaining efficiency. Key contributions include Appearance Awareness and Detail Refinement for adjacent views, Spiral Mamba and Accumulated Inconsistency Rectification for global consistency and selective downsampling, and extensive ablations showing their impact. The approach yields higher-quality, semantically aligned multiview outputs than state-of-the-art methods and can be integrated into existing 3D reconstruction pipelines to accelerate dense view synthesis.

Abstract

Generating dense multiview images from text prompts is crucial for creating high-fidelity 3D assets. Nevertheless, existing methods struggle with space-view correspondences, resulting in sparse and low-quality outputs. In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. For achieving neighbor-view consistency, each viewpoint densely interacts with adjacent viewpoints to perceive the global spatial structure, and aggregates information along motion paths explicitly defined by physical principles to refine details. To further enhance cross-view consistency and alleviate content drift, CoSER rapidly scan all views in spiral bidirectional manner to aware holistic information and then scores each point based on semantic material. Subsequently, we conduct weighted down-sampling along the spatial dimension based on scores, thereby facilitating prominent information fusion across all views with lightweight computation. Technically, the core module is built by integrating the attention mechanism with a selective state space model, exploiting the robust learning capabilities of the former and the low overhead of the latter. Extensive evaluation shows that CoSER is capable of producing dense, high-fidelity, content-consistent multiview images that can be flexibly integrated into various 3D generation models.

Focus on Neighbors and Know the Whole: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation

TL;DR

This work addresses the challenge of producing dense, consistent multiview images from text prompts for 3D creation. It introduces CoSER, a hybrid architecture that combines neighbor-focused contextual attention with spiral bidirectional scan-based state-space modeling to enforce both short- and long-range cross-view coherence while maintaining efficiency. Key contributions include Appearance Awareness and Detail Refinement for adjacent views, Spiral Mamba and Accumulated Inconsistency Rectification for global consistency and selective downsampling, and extensive ablations showing their impact. The approach yields higher-quality, semantically aligned multiview outputs than state-of-the-art methods and can be integrated into existing 3D reconstruction pipelines to accelerate dense view synthesis.

Abstract

Generating dense multiview images from text prompts is crucial for creating high-fidelity 3D assets. Nevertheless, existing methods struggle with space-view correspondences, resulting in sparse and low-quality outputs. In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. For achieving neighbor-view consistency, each viewpoint densely interacts with adjacent viewpoints to perceive the global spatial structure, and aggregates information along motion paths explicitly defined by physical principles to refine details. To further enhance cross-view consistency and alleviate content drift, CoSER rapidly scan all views in spiral bidirectional manner to aware holistic information and then scores each point based on semantic material. Subsequently, we conduct weighted down-sampling along the spatial dimension based on scores, thereby facilitating prominent information fusion across all views with lightweight computation. Technically, the core module is built by integrating the attention mechanism with a selective state space model, exploiting the robust learning capabilities of the former and the low overhead of the latter. Extensive evaluation shows that CoSER is capable of producing dense, high-fidelity, content-consistent multiview images that can be flexibly integrated into various 3D generation models.
Paper Structure (13 sections, 6 equations, 7 figures, 1 table)

This paper contains 13 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: CoSER aims to generate detailed, diverse and detaild 3D objects from text prompts. All prompts are from T3Bench T3 and the templated is set as "A DSLR photo of ..., 3d asset". Better see in color and 2x zoom.
  • Figure 2: Illustration of our CoSER. Given images rendered from 12 views at the same elevation, we take a pre-trained text-to-image generation model and fine-tune it by incorporating camera poses and lifting 2D-UNet to generate multi-view images. Specifically, we achieve 3D perception by employing fine-grained learning for neighbors viewpoints and coarse-grained interactions across whole viewpoints. For neighbors, Appearance Awareness (AA) is used to learn the basic appearance and Detail Refinement (DR) is proposed to ensure neighbor consistency in point-level. For whole views, we quickly scan all viewpoints with Rapid Glance (RG) and then eliminate ambiguity with powerful Accumulated Inconsistency Rectification (AIR).
  • Figure 3: Qualitative comparison of VideoMV videomv (Up) and our CoSER (Down).
  • Figure 4: Qualitative comparisons of GaussianDreamer gaussiandreamer, Hash3D hash3d and our CoSER.
  • Figure 5: Ablation of proposed moudles. The arrows indicate the changes made to the Appearance Awareness by sequentially plusing Detail Refinement, Rapid Glance, and Accumulated Inconsistency Rectification. Better see in color and 2x zoom.
  • ...and 2 more figures