Table of Contents
Fetching ...

GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska, Mikołaj Zieliński, Rafał Tobiasz, Krzysztof Byrski, Maciej Zięba, Dominik Belter, Przemysław Spurek

TL;DR

GaINeR addresses the lack of explicit geometry in 2D implicit neural representations by introducing trainable Gaussian embeddings that condition an INR decoder via radius-limited KNN aggregation, yielding continuous, geometry-aware reconstructions and intuitive local edits. The framework extends to 2D-to-3D lifting by lifting Gaussian means into 3D and predicting density for NeRF-like rendering, enabling depth-aware novel viewpoints and integration with physical simulations. Across Kodak and DIV2K, GaINeR achieves state-of-the-art reconstruction quality and demonstrates robust editing and physics-enabled dynamics, while maintaining a coherent, interpretable geometric prior. This work provides a unified, editable, geometry-aware representation that bridges 2D perception and 3D reasoning, with broad implications for interactive graphics and simulation-based learning.

Abstract

Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

GaINeR: Geometry-Aware Implicit Network Representation

TL;DR

GaINeR addresses the lack of explicit geometry in 2D implicit neural representations by introducing trainable Gaussian embeddings that condition an INR decoder via radius-limited KNN aggregation, yielding continuous, geometry-aware reconstructions and intuitive local edits. The framework extends to 2D-to-3D lifting by lifting Gaussian means into 3D and predicting density for NeRF-like rendering, enabling depth-aware novel viewpoints and integration with physical simulations. Across Kodak and DIV2K, GaINeR achieves state-of-the-art reconstruction quality and demonstrates robust editing and physics-enabled dynamics, while maintaining a coherent, interpretable geometric prior. This work provides a unified, editable, geometry-aware representation that bridges 2D perception and 3D reasoning, with broad implications for interactive graphics and simulation-based learning.

Abstract

Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

Paper Structure

This paper contains 28 sections, 13 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: GaINeR produces realistic image edits by preserving structural consistency under spatial transformations. Red arrows indicate the direction of the applied changes.
  • Figure 2: Overview of the GaINeR framework. (a) Training: For RGBA inputs, Gaussian components outside the alpha mask are removed from training. Each input coordinate $\mathbf{x}$ is encoded via Gaussian features $\mathbf{e}_i$ derived from the multi-resolution hashgrid $\mathcal{H}(\mu_i)$. The aggregated embedding $\mathbf{e}_{\mathrm{KNN}}(\mathbf{x}, \mathcal{G})$ is obtained through radius-limited KNN interpolation and decoded by the MLP $f_\theta$ to reconstruct pixel color $\mathbf{c}$. (b) Rendering and Editing: After training, the optimized Gaussian set $\mathcal{G}$ is used to reconstruct the sampling mask (for RGBA cases) and to enable spatial transformations. Edited Gaussian means $\mu_i' \in \mathbb{R}^2$ are re-encoded to produce updated embeddings $\mathbf{e}_{\mathrm{KNN}}'(\mathbf{x}, \mathcal{G}')$, which the decoder maps to modified image values, enabling geometry-consistent editing.
  • Figure 3: Comparison of PSNR obtained on a Singapore image from DIV2K dataset. All models were trained for 30 k iterations, yet our method rapidly achieves the highest PSNR within the first 1–2 k iterations, often already surpassing all other approaches.
  • Figure 4: Overview of 3D reconstruction and rendering with GaINeR. Input images are first segmented using Segment Anything (SAM) kirillov2023segment and depth is estimated with Depth-Pro bochkovskiy2025depthpro. The model is trained in 2D, but during inference the Gaussian means $\mu_i$ are lifted into 3D using the estimated depth values. The decoder is extended to predict both color $\mathbf{c}$ and density $\sigma$, enabling volumetric rendering analogous to NeRF mildenhall2021nerf. This allows the model to synthesize consistent 3D views from novel camera perspectives, as illustrated at the bottom of the figure.
  • Figure 5: Visual comparison of pixel-wise absolute errors between ground truth and reconstructed images produced by our method and competing approaches. Error magnitudes are contrast-enhanced using gamma correction ($\gamma = 0.2$) for better visibility.
  • ...and 10 more figures