Table of Contents
Fetching ...

SEGAR: Selective Enhancement for Generative Augmented Reality

Fanjun Bu, Chenyang Yuan, Hiroshi Yasuda

Abstract

Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.

SEGAR: Selective Enhancement for Generative Augmented Reality

Abstract

Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.

Paper Structure

This paper contains 18 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: SEGAR system pipeline overview. In Stage I, we train a Vista-based generative stylizer to take three condition frames ($t \in [1,3]$) and output future frames with desired augmented edits ($t \in [4,12]$). In Stage II, the generative stylizer finetuned with LoRA takes the augmented future frame and the corresponding real-world observation as inputs, and outputs a corrected frame in which safety-critical regions are aligned with reality while augmented edits are preserved. In this example, a car that appears in frames 10--12 is absent from the initial prediction because it was not visible in the three condition frames. The corrected output faithfully restores the car while retaining the augmented appearance of non-critical regions.
  • Figure 2: Given an input image sequence, we compute inpainting regions using semantic segmentation. The resulting masks guide VACE's inpainting process to augment static scene elements into a Tokyo-style appearance.
  • Figure 3: Example training target for Stage I training. The three frames in the top row are condition frames, whose clean latents are injected into the denoising process via dynamic prior injection gao2024vista. The remaining nine future frames are generated using VACE with desired visual augmentations.
  • Figure 4: (Left) Real-world image. (Middle) Semantic segmentation of safety-critical regions. (Right) The Buffer zone is introduced by dilating the safety-critical mask, shown in black. Loss is not computed in the buffer zone.
  • Figure 5: Visualization of latent representations for the real observation before (top) and after (bottom) masking (masked regions inpainted in gray), showing that spatial structure is preserved in the latent space (right) and validating the use of mask downsampling for region-specific loss computation.
  • ...and 4 more figures