Table of Contents
Fetching ...

ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation

Sergey Zakharov, Katherine Liu, Adrien Gaidon, Rares Ambrus

TL;DR

ReFiNe introduces a Recursive Field Network that encodes multiple 3D assets as continuous implicit fields within a single lightweight network by recursively expanding a per object latent through an octree with pruning. The method unifies global and local conditioning via multiscale feature fusion and decodes into flexible outputs (SDF, SDF+RGB, NeRF) suitable for ray tracing and differentiable rendering. Across Thingi32, ShapeNet150, SRN Cars, GSO, and RTMV, ReFiNe achieves high fidelity with dramatically reduced memory usage, enabling scalable multi object representations and cross modal rendering within a single network per dataset. The approach yields compact models that retain high frequency geometry and texture details, enabling practical compression and rendering applications while highlighting a coherent latent space structure and smooth latent interpolation.

Abstract

The common trade-offs of state-of-the-art methods for multi-shape representation (a single model "packing" multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing. In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.

ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation

TL;DR

ReFiNe introduces a Recursive Field Network that encodes multiple 3D assets as continuous implicit fields within a single lightweight network by recursively expanding a per object latent through an octree with pruning. The method unifies global and local conditioning via multiscale feature fusion and decodes into flexible outputs (SDF, SDF+RGB, NeRF) suitable for ray tracing and differentiable rendering. Across Thingi32, ShapeNet150, SRN Cars, GSO, and RTMV, ReFiNe achieves high fidelity with dramatically reduced memory usage, enabling scalable multi object representations and cross modal rendering within a single network per dataset. The approach yields compact models that retain high frequency geometry and texture details, enabling practical compression and rendering applications while highlighting a coherent latent space structure and smooth latent interpolation.

Abstract

The common trade-offs of state-of-the-art methods for multi-shape representation (a single model "packing" multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing. In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.
Paper Structure (38 sections, 5 equations, 14 figures, 7 tables)

This paper contains 38 sections, 5 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: ReFiNe architecture. ReFiNe uses an implicit recursive hierarchical representation and a combination of spatial and global-to-local feature fusion to accurately reconstruct 3D assets. Given a single input feature corresponding to LoD 0, ReFiNe recursively expands an octree to the desired LoD using the latent subdivision network $\phi$. Unoccupied voxels at each LoD are pruned based on the output of $\omega$. To obtain a feature value at a specific spatial coordinate, we perform tri-linear interpolation within each individual LoD, then aggregate the features via multi-scale feature fusion. Finally, we use $\xi$ and $\psi$ to decode color and geometry respectively for the desired coordinate. Given the ability to query coordinates within the scene bounds, various methods including differentiable rendering can be applied for reconstruction. Importantly, ReFiNe optimizes a single LoD 0 feature per 3D asset in the training dataset, enabling multiple assets to be reconstructed from a single trained ReFiNe network. Voxel grids at LoDs not drawn to scale.
  • Figure 2: SDF reconstruction comparisons on selected Thingi32 and ShapeNet150 objects. DeepSDF and Curriculum DeepSDF capture the high-level geometry of visualized objects but fail to accurately model high frequency details such as the teeth in the top row and the chair legs in the bottom row. In this example, we also observe that ReFiNe is capable of representing geometry accurately while preserving overall shape smoothness better than ROAD (i.e., as seen in the teeth in the top row). Unlike ROAD, which also models objects recursively but discretely at each LoD, ReFiNe models objects as a continuous fields using multi-scale feature interpolation. In this visualization, ROAD and ReFiNe use nine and six LoDs, respectively. Quantitative results reported in Table \ref{['tab:sdf']}.
  • Figure 3: Decoded Datasets. ReFiNe can encode Google Scanned Objects, a complex dataset of 1030 colored 3D objects within a single neural network of size 45.6 MB and a list of latent vectors of 1.05 MB (whereas the original meshes without texture require about 1.5 GB of storage). On the right, we show complex decoded reconstructions from our network trained on the RTMV dataset of 40 diverse scenes.
  • Figure 4: Qualitative RTMV ablation results. We observe that increasing the latent size increases the capacity of ReFiNe for high fidelity reconstruction, and that more complex scenarios such as the cluttered scene in the bottom row may benefit more from larger latent spaces. Both objects are rendered from the same network. Quantitative results reported in Table \ref{['tab:rtmv']}.
  • Figure 5: SRN Cars benchmark. Our method outperforms other methods at reconstructing high-frequency details.
  • ...and 9 more figures