Table of Contents
Fetching ...

End-to-End Implicit Neural Representations for Classification

Alexander Gielisse, Jan van Gemert

TL;DR

This work tackles the challenge of performing image classification using implicit neural representations (INRs) by introducing an end-to-end meta-learning framework that jointly learns a SIREN initialization $\theta$ and an inner-loop learning-rate schedule $\alpha$. After a few gradient steps, an image-specific INR with parameters $\phi$ is classified by a Transformer operating on weight-space tokens, without enforcing symmetry-equivariant designs. The approach yields strong, state-of-the-art results on several benchmarks (e.g., CIFAR-10, MNIST, Fashion-MNIST) and demonstrates scalability to high-resolution data such as Imagenette and ImageNet-1K, while allowing computational shortcuts like pixel subsampling. Overall, the paper shifts INR research toward classification-aware representations, showing that end-to-end structure learning can unlock competitive, even superior, performance relative to pixel-based methods for INR-based classification.

Abstract

Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly under-performs compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art without augmentations from 38.8% to 59.6%, and from 63.4% to 64.7% with augmentations. We demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.8% and are the first to do INR classification on the full ImageNet-1K dataset where we achieve a SIREN classification performance of 23.6%. To the best of our knowledge, no other SIREN classification approach has managed to set a classification baseline for any high-resolution image dataset. Our code is available at https://github.com/SanderGielisse/MWT

End-to-End Implicit Neural Representations for Classification

TL;DR

This work tackles the challenge of performing image classification using implicit neural representations (INRs) by introducing an end-to-end meta-learning framework that jointly learns a SIREN initialization and an inner-loop learning-rate schedule . After a few gradient steps, an image-specific INR with parameters is classified by a Transformer operating on weight-space tokens, without enforcing symmetry-equivariant designs. The approach yields strong, state-of-the-art results on several benchmarks (e.g., CIFAR-10, MNIST, Fashion-MNIST) and demonstrates scalability to high-resolution data such as Imagenette and ImageNet-1K, while allowing computational shortcuts like pixel subsampling. Overall, the paper shifts INR research toward classification-aware representations, showing that end-to-end structure learning can unlock competitive, even superior, performance relative to pixel-based methods for INR-based classification.

Abstract

Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly under-performs compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art without augmentations from 38.8% to 59.6%, and from 63.4% to 64.7% with augmentations. We demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.8% and are the first to do INR classification on the full ImageNet-1K dataset where we achieve a SIREN classification performance of 23.6%. To the best of our knowledge, no other SIREN classification approach has managed to set a classification baseline for any high-resolution image dataset. Our code is available at https://github.com/SanderGielisse/MWT

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Left: the existing two-step INR classification approach, where the process involves building INR datasets by fitting images using many update steps without classifier feedback for its weight structure. The classifier is trained separately, and the classification loss cannot influence the INR. Right: the proposed end-to-end meta-learned INR approach for classification. This method fits high-resolution images using a few update steps with learned initialization and feedback from the classifier, allowing the classification loss to update both the classifier and the INR weight structure, enhancing downstream performance while ensuring quick convergence.
  • Figure 2: Our method illustrated for a high-resolution Imagenette imagenette image. A meta-learned SIREN initialization is updated for a small amount of gradient steps, in this case $k=4$. The resulting weights are then passed to a classifier. Meta-learning allows us to back-propagate through the update steps, end-to-end optimizing the SIREN both for reconstruction as for classification.
  • Figure 3: ModelNet40 modelnet40 results on unsigned distance functions. Trained for 150 epochs using dimensionality of $256$, number of inner-loop steps $k=4$, with augmentations enabled. Timings include the inner-loop fitting of the INR. The inr2vec work inr2vec scores an accuracy of $87.0\%$ on this dataset, with non-INR approaches like PointNet pointnet and PointNet++ pointnet_pp outperforming these with accuracies of $88.8\%$ and $89.7\%$ respectively.
  • Figure 4: We provide a visualization of the trade-off made in the ablation of the MWT model. Increasing the influence of the classifier on the meta-learning of the INR ($w_{\text{task}}$) decreases reconstruction quality, but increases classification performance up to about $0.01$, after which both decrease.
  • Figure 5: Our proposed meta-learning framework can quickly adapt to a new signal. This figure demonstrates how a SIREN converges on high-resolution Imagenette images in just $k=4$ update steps, while also providing INR weights that are structured for classification. The images shown here come from $\text{MWT-L}_{s=0.1}$ trained on Imagenette, and have an average PSNR of $22.31$ dB after the last step $k=4$.