Table of Contents
Fetching ...

Affine-Equivariant Kernel Space Encoding for NeRF Editing

Mikołaj Zieliński, Krzysztof Byrski, Tomasz Szczepanik, Dominik Belter, Przemysław Spurek

TL;DR

Affine-Equivariant Kernel Space Encoding (EKS) redefines NeRF latent spaces by using a field of anisotropic Gaussian kernels to enable localized, deformation-aware editing while preserving rendering fidelity. Features are interpolated through Mahalanobis-distance-based weights over nearby Gaussians, and a Ray-Traced Gaussian Proximity Search ensures affine-consistent neighborhood queries. A training-time hash-grid feature distillation transfers detail into the kernel field, yielding a grid-free representation suitable for editing via Gaussian tetrahedra bound to meshes. Empirical results on NeRF-Synthetic, Mip-NeRF 360, and physics-based benchmarks show competitive reconstruction quality and superior editing robustness, including physics-driven scene manipulation, relative to prior editable NeRF methods.

Abstract

Neural scene representations achieve high-fidelity rendering by encoding 3D scenes as continuous functions, but their latent spaces are typically implicit and globally entangled, making localized editing and physically grounded manipulation difficult. While several works introduce explicit control structures or point-based latent representations to improve editability, these approaches often suffer from limited locality, sensitivity to deformations, or visual artifacts. In this paper, we introduce Affine-Equivariant Kernel Space Encoding (EKS), a spatial encoding for neural radiance fields that provides localized, deformation-aware feature representations. Instead of querying latent features directly at discrete points or grid vertices, our encoding aggregates features through a field of anisotropic Gaussian kernels, each defining a localized region of influence. This kernel-based formulation enables stable feature interpolation under spatial transformations while preserving continuity and high reconstruction quality. To preserve detail without sacrificing editability, we further propose a training-time feature distillation mechanism that transfers information from multi-resolution hash grid encodings into the kernel field, yielding a compact and fully grid-free representation at inference. This enables intuitive, localized scene editing directly via Gaussian kernels without retraining, while maintaining high-quality rendering. The code can be found under (https://github.com/MikolajZielinski/eks)

Affine-Equivariant Kernel Space Encoding for NeRF Editing

TL;DR

Affine-Equivariant Kernel Space Encoding (EKS) redefines NeRF latent spaces by using a field of anisotropic Gaussian kernels to enable localized, deformation-aware editing while preserving rendering fidelity. Features are interpolated through Mahalanobis-distance-based weights over nearby Gaussians, and a Ray-Traced Gaussian Proximity Search ensures affine-consistent neighborhood queries. A training-time hash-grid feature distillation transfers detail into the kernel field, yielding a grid-free representation suitable for editing via Gaussian tetrahedra bound to meshes. Empirical results on NeRF-Synthetic, Mip-NeRF 360, and physics-based benchmarks show competitive reconstruction quality and superior editing robustness, including physics-driven scene manipulation, relative to prior editable NeRF methods.

Abstract

Neural scene representations achieve high-fidelity rendering by encoding 3D scenes as continuous functions, but their latent spaces are typically implicit and globally entangled, making localized editing and physically grounded manipulation difficult. While several works introduce explicit control structures or point-based latent representations to improve editability, these approaches often suffer from limited locality, sensitivity to deformations, or visual artifacts. In this paper, we introduce Affine-Equivariant Kernel Space Encoding (EKS), a spatial encoding for neural radiance fields that provides localized, deformation-aware feature representations. Instead of querying latent features directly at discrete points or grid vertices, our encoding aggregates features through a field of anisotropic Gaussian kernels, each defining a localized region of influence. This kernel-based formulation enables stable feature interpolation under spatial transformations while preserving continuity and high reconstruction quality. To preserve detail without sacrificing editability, we further propose a training-time feature distillation mechanism that transfers information from multi-resolution hash grid encodings into the kernel field, yielding a compact and fully grid-free representation at inference. This enables intuitive, localized scene editing directly via Gaussian kernels without retraining, while maintaining high-quality rendering. The code can be found under (https://github.com/MikolajZielinski/eks)

Paper Structure

This paper contains 26 sections, 12 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: EKS overview. EKS represents positional features using spatially localized anisotropic Gaussian kernels, enabling stable and fine-grained interactive editing while maintaining the high-fidelity rendering of Neural Radiance Fields.
  • Figure 2: Physical simulations. From left to right: (1) Rigid body simulation of falling leaves. (2) Soft body simulation of the Lego dozer being squished. (3) Cloth simulation of fabric falling onto a cup. The middle columns show the deformation-driving meshes.
  • Figure 3: Evolution of two physical simulations. From left to right: (1) A rubber duck falling onto a pillow and deforming it. (2) A pirate flag waving under the influence of wind. Both simulations are performed on our own assets.
  • Figure 4: Model overview. Top: During training, a subset of Gaussians is selected using Ray-Traced Gaussian Proximity Search (RT-GPS), which also handles pruning. The nearest Gaussians to the sampling position $\mathbf{x}$ are passed to the Kernel Space Encoding, which interpolates their features to produce the final positional embedding $\mathbf{v}(\mathbf{x}; \mathcal{G})$. The embedding is then processed by the neural network $\mathcal{F}$ to predict colour $\mathbf{c}$ and opacity $\sigma$, which are used for volumetric rendering. Bottom: At inference time, the learned Gaussians serve as input parameters and can undergo manual or physics-driven edits. The edited Gaussians are passed through the same rendering pipeline to generate the final image, with the view-direction input to $\mathcal{F}$ adjusted by the inverse rotation of the modified Gaussians. Since the kernel space encoding is fixed after training, the auxiliary network $\mathcal{H}_{\text{enc}}$ is omitted during inference.
  • Figure 5: KNN Comparisons. Comparison of neighbourhood changes under deformation using Euclidean distance KNN (top) versus our proposed Mahalanobis distance KNN (bottom). Moving points in traditional encodings changes local neighbourhoods inconsistently, causing unstable feature interpolation. Our method preserves relative feature structure under spatial transformations and yields visibly improved results with no holes and distortions.
  • ...and 7 more figures