Table of Contents
Fetching ...

RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong

TL;DR

RenderFormer tackles the challenge of fast, high-fidelity rendering with full global illumination without per-scene training. It introduces a two-stage transformer pipeline that first models triangle-to-triangle light transport (view-independent) and then maps ray-bundle tokens to pixel radiance (view-dependent), using a 3D-relative positional encoding. Trained end-to-end on synthetic triangle meshes, it demonstrates strong generalization across scenes and lights, while balancing accuracy and compute compared to traditional path tracing. This approach offers a differentiable, scene-agnostic alternative to classical rendering and NeRF-like methods, with potential extensions to inverse rendering and larger, more complex scenes.

Abstract

We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.

RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

TL;DR

RenderFormer tackles the challenge of fast, high-fidelity rendering with full global illumination without per-scene training. It introduces a two-stage transformer pipeline that first models triangle-to-triangle light transport (view-independent) and then maps ray-bundle tokens to pixel radiance (view-dependent), using a 3D-relative positional encoding. Trained end-to-end on synthetic triangle meshes, it demonstrates strong generalization across scenes and lights, while balancing accuracy and compute compared to traditional path tracing. This approach offers a differentiable, scene-agnostic alternative to classical rendering and NeRF-like methods, with potential extensions to inverse rendering and larger, more complex scenes.

Abstract

We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.

Paper Structure

This paper contains 29 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: RenderFormer Architecture Overview. Top: the view-independent stage resolves triangle-to-triangle light transport from a sequence of triangle tokens that encode the reflectance properties of each triangle. The relative position of each triangle is separately encoded, and applied to each token at each self-attention layer. Bottom: the view-dependent stage takes as input the virtual camera position encoded as a sequence of ray-bundles. Guided by the resulting triangle tokens from the view-independent stage via a cross-attention layer, the ray-bundle tokens are transformed to tokens encoding the outgoing radiance per view ray. Finally, the ray-bundle tokens are transformed to log-encoded HDR radiance value through an additional dense vision transformer.
  • Figure 2: The four template scenes used for generating training data.
  • Figure 3: A variety of scenes rendered with RenderFormer and compared to path-traced reference images. We also list the PSNR, SSIM, LPIPS, and FLIP errors.
  • Figure 4: Equal-time comparison between RenderFormer and Blender Cycles with (non-adaptive) $26$ sampler-per-pixel and without denoising.
  • Figure 5: Qualitative comparison of varying #view-independent + #view-dependent attention layers per stage. RenderFormer is shown in the last column with a ratio of $12$ view-independent versus $6$ view-dependent layers.
  • ...and 10 more figures