Table of Contents
Fetching ...

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

Shanlin Sun, Yifan Wang, Hanwen Zhang, Yifeng Xiong, Qin Ren, Ruogu Fang, Xiaohui Xie, Chenyu You

TL;DR

Ouroboros addresses the cycle inconsistency and inefficiency of sequential forward and inverse diffusion rendering. It introduces two single-step diffusion models for RGB→X (inverse rendering) and X→RGB (forward rendering) that are trained with cycle-consistency losses and end-to-end fine-tuning, enabling fast, coherent bidirectional rendering. The approach leverages heterogeneous indoor/outdoor datasets (Hypersim, InteriorVerse, MatrixCity), uses channel dropout, and extends to training-free video inference with temporal patching and pseudo-3D kernels. It achieves state-of-the-art or competitive results on both inverse and forward rendering tasks, delivering up to a 50× speedup over prior diffusion methods and enabling temporally consistent video decomposition without video-specific training.

Abstract

While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

TL;DR

Ouroboros addresses the cycle inconsistency and inefficiency of sequential forward and inverse diffusion rendering. It introduces two single-step diffusion models for RGB→X (inverse rendering) and X→RGB (forward rendering) that are trained with cycle-consistency losses and end-to-end fine-tuning, enabling fast, coherent bidirectional rendering. The approach leverages heterogeneous indoor/outdoor datasets (Hypersim, InteriorVerse, MatrixCity), uses channel dropout, and extends to training-free video inference with temporal patching and pseudo-3D kernels. It achieves state-of-the-art or competitive results on both inverse and forward rendering tasks, delivering up to a 50× speedup over prior diffusion methods and enabling temporally consistent video decomposition without video-specific training.

Abstract

While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.

Paper Structure

This paper contains 32 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Single-step Diffusion Models for Forward and Inverse Rendering in Cycle Consistency. Left Upper: Ouroboros decomposes input images into intrinsic maps (albedo, normal, roughness, metallicity, and irradiance). Given these generated intrinsic maps and textual prompts, our neural forward rendering model synthesizes images closely matching the originals. Right Upper: We extend an end-to-end finetuning technique martingarcia2024diffusione2eft to diffusion-based neural rendering, outperforming state-of-the-art RGB$\leftrightarrow$X zeng2024rgb in both speed and accuracy. The radar plot illustrates numerical comparisons on the InteriorVerse dataset zhu2022learning. Bottom: Our method achieves temporally consistent video inverse rendering without specific finetuning on video data.
  • Figure 2: Overview of Ouroboros Pipeline. (a) presents the training pipeline of our single-step Diffusion-based inverse and forward rendering model. For inverse rendering, the model takes the image $I$ and text prompt indicating the output intrinsic maps as input to finetune the latent diffusion UNet. For forward rendering, the model is fed with concatenated intrinsic maps along with simple image description to estimate the original image. (b) provides the overview of cycle training pipeline.
  • Figure 3: Iterative Video Generation Pipeline. Overlapping windows are processed sequentially, with latent representations from previous windows guiding the initialization of overlapping regions. In practice, the window size and overlap are larger than the figure shown.
  • Figure 4: Comprehensive Visual Comparison between Baseline Models and our Ouroboros on Diverse Inverse Rendering Tasks.
  • Figure 5: Examples of Video Inference. Our model demonstrates the ability to process real-world scenarios.
  • ...and 4 more figures