Table of Contents
Fetching ...

ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies

Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma

Abstract

Automating immersive VR scene creation remains a primary research challenge. Existing methods typically rely on complex geometry with post-simplification, resulting in inefficient pipelines or limited realism. In this paper, we introduce ImmerseGen, a novel agent-guided framework for compact and photorealistic world generation that decouples realism from exhaustive geometric modeling. ImmerseGen represents scenes as hierarchical compositions of lightweight geometric proxies with synthesized RGBA textures, facilitating real-time rendering on mobile VR headsets. We propose terrain-conditioned texturing for base world generation, combined with context-aware texturing for scenery, to produce diverse and visually coherent worlds. VLM-based agents employ semantic grid-based analysis for precise asset placement and enrich scenes with multimodal enhancements such as visual dynamics and ambient sound. Experiments and real-time VR applications demonstrate that ImmerseGen achieves superior photorealism, spatial coherence, and rendering efficiency compared to existing methods.

ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies

Abstract

Automating immersive VR scene creation remains a primary research challenge. Existing methods typically rely on complex geometry with post-simplification, resulting in inefficient pipelines or limited realism. In this paper, we introduce ImmerseGen, a novel agent-guided framework for compact and photorealistic world generation that decouples realism from exhaustive geometric modeling. ImmerseGen represents scenes as hierarchical compositions of lightweight geometric proxies with synthesized RGBA textures, facilitating real-time rendering on mobile VR headsets. We propose terrain-conditioned texturing for base world generation, combined with context-aware texturing for scenery, to produce diverse and visually coherent worlds. VLM-based agents employ semantic grid-based analysis for precise asset placement and enrich scenes with multimodal enhancements such as visual dynamics and ambient sound. Experiments and real-time VR applications demonstrate that ImmerseGen achieves superior photorealism, spatial coherence, and rendering efficiency compared to existing methods.

Paper Structure

This paper contains 39 sections, 4 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Asset comparison from different sources. We compare assets created by learning-based generative methods (blue labels), artists (green labels), and ours. Our generative RGBA-textured proxy assets achieve better visual details than existing models clayxiang2024structured with fewer triangles, delivering photorealistic appearance comparable to artist-created high-poly or baked assets.
  • Figure 2: Overview. Given a user’s text input, the agent first retrieves a base terrain. Conditioned on the terrain depth and an extended prompt, panoramic textures for the terrain and sky are generated to form a layered base world. Next, VLM-based asset agents enrich the scene by selecting asset proxies as foreground or midground scenery, designing detailed asset prompts, and determining optimal asset placement. Each asset is instantiated via RGBA texture synthesis. Finally, the agent incorporates dynamic visual effects and synthesized ambient sound, producing a lightweight and photorealistic world.
  • Figure 3: Workflow of base world generation. Panoramic textures for terrain mesh and sky are generated for the base world. To tame the diffusion model for terrain texturing, we propose geometric adaptation (b) for depth control and user-centric texture mapping (c).
  • Figure 4: The proposed context-aware texture synthesis (a) generates diverse, contextually coherent RGBA textures directly on lightweight proxies for both foreground and midground scenery (b).
  • Figure 5: The proposed semantic grid-based analysis overlays a labeled grid and masks unsuitable regions as visual prompts. This enables the VLM agent to progressively select grid cells in a coarse-to-fine manner, improving the accuracy and semantic coherence of asset arrangement.
  • ...and 9 more figures