Table of Contents
Fetching ...

Cicero: Addressing Algorithmic and Architectural Bottlenecks in Neural Rendering by Radiance Warping and Memory Optimizations

Yu Feng, Zihan Liu, Jingwen Leng, Minyi Guo, Yuhao Zhu

TL;DR

Cicero addresses the dual bottlenecks of neural rendering by combining SpaRW, memory-centric DRAM access, and bank-conflict-free on-chip layouts in a co-designed SoC. SpaRW reuses radiances from nearby frames to cut up to 88% of NeRF MLP computations with minimal PSNR loss, while memory optimizations convert DRAM access into a streaming pattern and eliminate SRAM bank conflicts through a channel-major data layout. A dedicated Gathering Unit augmentation and a memory-centric pipeline enable fully-streaming NeRF rendering with modest area overhead, achieving up to 28.2× speed-up and 37.8× energy savings on mobile hardware compared to a DNN-accelerated baseline. Across synthetic and real-world datasets, Cicero demonstrates strong performance gains for local and remote VR/AR rendering with acceptable rendering quality, highlighting practical impact for real-time neural rendering on mobile platforms.

Abstract

Neural Radiance Field (NeRF) is widely seen as an alternative to traditional physically-based rendering. However, NeRF has not yet seen its adoption in resource-limited mobile systems such as Virtual and Augmented Reality (VR/AR), because it is simply extremely slow. On a mobile Volta GPU, even the state-of-the-art NeRF models generally execute only at 0.8 FPS. We show that the main performance bottlenecks are both algorithmic and architectural. We introduce, CICERO, to tame both forms of inefficiencies. We first introduce two algorithms, one fundamentally reduces the amount of work any NeRF model has to execute, and the other eliminates irregular DRAM accesses. We then describe an on-chip data layout strategy that eliminates SRAM bank conflicts. A pure software implementation of CICERO offers an 8.0x speed-up and 7.9x energy saving over a mobile Volta GPU. When compared to a baseline with a dedicated DNN accelerator, our speed-up and energy reduction increase to 28.2x and 37.8x, respectively - all with minimal quality loss (less than 1.0 dB peak signal-to-noise ratio reduction).

Cicero: Addressing Algorithmic and Architectural Bottlenecks in Neural Rendering by Radiance Warping and Memory Optimizations

TL;DR

Cicero addresses the dual bottlenecks of neural rendering by combining SpaRW, memory-centric DRAM access, and bank-conflict-free on-chip layouts in a co-designed SoC. SpaRW reuses radiances from nearby frames to cut up to 88% of NeRF MLP computations with minimal PSNR loss, while memory optimizations convert DRAM access into a streaming pattern and eliminate SRAM bank conflicts through a channel-major data layout. A dedicated Gathering Unit augmentation and a memory-centric pipeline enable fully-streaming NeRF rendering with modest area overhead, achieving up to 28.2× speed-up and 37.8× energy savings on mobile hardware compared to a DNN-accelerated baseline. Across synthetic and real-world datasets, Cicero demonstrates strong performance gains for local and remote VR/AR rendering with acceptable rendering quality, highlighting practical impact for real-time neural rendering on mobile platforms.

Abstract

Neural Radiance Field (NeRF) is widely seen as an alternative to traditional physically-based rendering. However, NeRF has not yet seen its adoption in resource-limited mobile systems such as Virtual and Augmented Reality (VR/AR), because it is simply extremely slow. On a mobile Volta GPU, even the state-of-the-art NeRF models generally execute only at 0.8 FPS. We show that the main performance bottlenecks are both algorithmic and architectural. We introduce, CICERO, to tame both forms of inefficiencies. We first introduce two algorithms, one fundamentally reduces the amount of work any NeRF model has to execute, and the other eliminates irregular DRAM accesses. We then describe an on-chip data layout strategy that eliminates SRAM bank conflicts. A pure software implementation of CICERO offers an 8.0x speed-up and 7.9x energy saving over a mobile Volta GPU. When compared to a baseline with a dedicated DNN accelerator, our speed-up and energy reduction increase to 28.2x and 37.8x, respectively - all with minimal quality loss (less than 1.0 dB peak signal-to-noise ratio reduction).
Paper Structure (63 sections, 6 equations, 26 figures)

This paper contains 63 sections, 6 equations, 26 figures.

Figures (26)

  • Figure 1: The rendering pipeline of today's NeRF algorithms. The computation flow is highlighted in purple. Each ray first samples points, $S_{1}$, $S_{2}$, and $S_{3}$, along the ray direction. Each ray sample gathers and interpolates 3D features from eight vertices of the intersected voxel ($V_3$, and $V_{32}$, and $V_{81}$). The interpolated features ($F_1$, $F_2$, and $F_3$) are then fed into the MLP to get the partial pixel values at the three ray samples. The final pixel value is accumulated from all partial pixel values mildenhall2021nerf.
  • Figure 2: Frame rate vs. model size on the Xavier SoC xaviersoc. Models muller2022instantsun2022directchen2022tensorfhu2022efficientnerfchen2023mobilenerfhedman2021baking are named by reference numbers.
  • Figure 3: Normalized execution breakdown across state-of-the-art NeRF algorithms muller2022instantchen2022tensorfhu2022efficientnerfsun2022direct.
  • Figure 4: Percentage of non-continuous DRAM accesses in feature gathering.
  • Figure 5: Cache miss rate in feature gathering across common NeRF algorithms.
  • ...and 21 more figures