Table of Contents
Fetching ...

MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Yuhan Wang, Fangzhou Hong, Shuai Yang, Liming Jiang, Wayne Wu, Chen Change Loy

TL;DR

This work tackles high-resolution, multiview human generation by introducing MEAT, a diffusion model that leverages a central clothed human mesh to establish cross-view correspondences through rasterization and projection. The mesh attention blocks enable memory-efficient fusion across 16 views at 1024×1024, addressing the prohibitive cost of traditional multiview attention. Key contributions include a mesh-attention design, keypoint conditioning, resolution upscaling with SDXL-VAE, and a training pipeline that adapts the DNA-Rendering multiview video dataset for diffusion training. Experiments show MEAT achieves superior density, texture detail, and cross-view consistency at megapixel resolution compared to existing multiview diffusion methods, marking a significant step toward practical, high-fidelity human novel-view synthesis.

Abstract

Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.

MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

TL;DR

This work tackles high-resolution, multiview human generation by introducing MEAT, a diffusion model that leverages a central clothed human mesh to establish cross-view correspondences through rasterization and projection. The mesh attention blocks enable memory-efficient fusion across 16 views at 1024×1024, addressing the prohibitive cost of traditional multiview attention. Key contributions include a mesh-attention design, keypoint conditioning, resolution upscaling with SDXL-VAE, and a training pipeline that adapts the DNA-Rendering multiview video dataset for diffusion training. Experiments show MEAT achieves superior density, texture detail, and cross-view consistency at megapixel resolution compared to existing multiview diffusion methods, marking a significant step toward practical, high-fidelity human novel-view synthesis.

Abstract

Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.

Paper Structure

This paper contains 23 sections, 15 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Given a frontal human image, MEAT can generate dense, view-consistent multiview images at a resolution of $1024^2$.
  • Figure 2: VAE and Resolution. Each row represents the same version of VAE, while each column corresponds to the same resolution of the full-body image after VAE reconstruction. Although the full-body image rendered at $512\times512$ shows good visual quality, it falls short when used in diffusion models with VAE. We find that a resolution of $1024\times1024$ is necessary for optimal results.
  • Figure 3: Mesh Attention Block. (a) $P_p$ aggregation. When the resolution of the feature map is very low, the ray cast from the center of a pixel may not intersect with the mesh, although the pixel area itself overlaps with it. (b) Projection. Each projected point is rounded to four integer pixels, corresponding to $d=4$ in \ref{['tab:multiview_attn']}. The projected points on the reference view are also used to retrieve the encoded VAE features. (c) MEAT block pipeline. We use mesh attention to fuse U-Net features from all $N$ views, and VAE features from the reference. An additional per-view self-attention block is applied to process the captured multiview features. $M$ stands for masked skip connection.
  • Figure 4: Pipeline of MEAT. We insert mesh attention blocks into up-sampling blocks of the U-Net to fuse multiview features.
  • Figure 5: Qualitative Results. MEAT (Ours) demonstrates significant advantages in resolution, detail, and cross-view consistency in novel view synthesis tasks. * Methods are re-trained on the DNA-Rendering dataset for fair comparison. Please zoom in for details.
  • ...and 4 more figures