BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
Gwanghyun Kim, Hayeon Kim, Hoigi Seo, Dong Un Kang, Se Young Chun
TL;DR
BeyondScene tackles the challenge of generating high‑resolution human‑centric scenes with diffusion models, addressing limitations in training image size and token budgets. It introduces a two‑stage framework: Detailed Base Image Generation to create per‑instance content beyond token limits, and Instance‑Aware Hierarchical Enlargement to upscale beyond training resolution. Key innovations include High Frequency‑Injected Forward Diffusion guided by Canny edges to inject high‑frequency details, and Adaptive Joint Diffusion with adaptive conditioning and adaptive stride to maintain pose accuracy and reduce duplication. Empirical results on CrowdCaption show superior text‑image correspondence and naturalness up to $8192\times8192$, demonstrating the method's ability to push beyond pretrained diffusion models without costly retraining.
Abstract
Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.
