Table of Contents
Fetching ...

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Gwanghyun Kim, Hayeon Kim, Hoigi Seo, Dong Un Kang, Se Young Chun

TL;DR

BeyondScene tackles the challenge of generating high‑resolution human‑centric scenes with diffusion models, addressing limitations in training image size and token budgets. It introduces a two‑stage framework: Detailed Base Image Generation to create per‑instance content beyond token limits, and Instance‑Aware Hierarchical Enlargement to upscale beyond training resolution. Key innovations include High Frequency‑Injected Forward Diffusion guided by Canny edges to inject high‑frequency details, and Adaptive Joint Diffusion with adaptive conditioning and adaptive stride to maintain pose accuracy and reduce duplication. Empirical results on CrowdCaption show superior text‑image correspondence and naturalness up to $8192\times8192$, demonstrating the method's ability to push beyond pretrained diffusion models without costly retraining.

Abstract

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

TL;DR

BeyondScene tackles the challenge of generating high‑resolution human‑centric scenes with diffusion models, addressing limitations in training image size and token budgets. It introduces a two‑stage framework: Detailed Base Image Generation to create per‑instance content beyond token limits, and Instance‑Aware Hierarchical Enlargement to upscale beyond training resolution. Key innovations include High Frequency‑Injected Forward Diffusion guided by Canny edges to inject high‑frequency details, and Adaptive Joint Diffusion with adaptive conditioning and adaptive stride to maintain pose accuracy and reduce duplication. Empirical results on CrowdCaption show superior text‑image correspondence and naturalness up to , demonstrating the method's ability to push beyond pretrained diffusion models without costly retraining.

Abstract

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.
Paper Structure (46 sections, 18 figures, 6 tables, 4 algorithms)

This paper contains 46 sections, 18 figures, 6 tables, 4 algorithms.

Figures (18)

  • Figure 1: BeyondScene pushes the boundaries of high-resolution human-centric scene generation. Unlike existing methods that often suffer from unrealistic scenes, anatomical distortions, and limited text-to-image correspondence, BeyondScene excels in 1) highly detailed scenes, 2) natural and diverse humans, 3) fine-grained control. This breakthrough paves the way for groundbreaking applications in human-centric scene design. The color in each description represents the description for each instance that has the same color in the pose map.
  • Figure 2: Beyond 8K ultra-high resolution image. This 8192$\times$8192 image, generated by BeyondScene, surpasses the training resolution of SDXL by 64$\times$, while exceeding the technical classification of 8K (7680$\times$4320).
  • Figure 3: BeyondScene generates high-resolution images in two stages. First, individual instances are created using pose-guided T2I diffusion models, segmented, cropped, and placed onto an inpainted background. The tone of image is then normalized. In the second stage (illustrated in Fig. \ref{['fig_method2']}), this base image is progressively enlarged while maintaining detail and quality, effectively refining the image and adding further details leveraging high-frequency injected forward diffusion and adaptive joint diffusion (AJ).
  • Figure 4: Our instance-aware hierarchical enlargement involves two crucial processes: 1) High frequency-injected forward diffusion, which enables to achieve high resolution through a joint diffusion employing adaptive pixel perturbation. 2) Adaptive joint diffusion, dynamically regulating stride and conditioning of pose and text based on the presence of instances.
  • Figure 5: Qualitative comparison for generating high-resolution human scenes (3584$\times$2048). While existing approaches like T2I-Direct (SDXL podell2023sdxl), T2I-Large (MultiDiffusion bar2023multidiffusion, SyncDiffusion lee2023syncdiffusion, and ScaleCrafter he2023scalecrafter), and Visual+T2I (ControlNet podell2023sdxlcontrolnet23, T2IAdapter podell2023sdxlt2i23, and R-MultiDiffusion bar2023multidiffusion) models struggle with artifacts, our method achieves superior results by producing images with minimal artifacts, strong text-image correspondence, and a natural look. The color in each description represents the description for each instance that has the same color in the pose map.
  • ...and 13 more figures