Table of Contents
Fetching ...

From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields

Miltiadis Kofinas, Samuele Papa, Efstratios Gavves

TL;DR

Neural fields enable continuous signal encoding but struggle with downstream tasks due to parameter-space symmetries and limited conditioning. The authors propose NeoMLP, a graph-based reinterpretation of MLPs with a complete input–hidden–output node graph and weight-sharing self-attention, enabling per-signal latent conditioning via the ($\nu$-reps) and ($\nu$-sets) framework. The approach yields strong results on high-resolution audio–visual data and neural representations, with competitive downstream performance and open-source code. This work offers a scalable, unified conditioning mechanism for neural fields and opens avenues for multi-modal and dataset-level representations.

Abstract

Neural fields (NeFs) have recently emerged as a state-of-the-art method for encoding spatio-temporal signals of various modalities. Despite the success of NeFs in reconstructing individual signals, their use as representations in downstream tasks, such as classification or segmentation, is hindered by the complexity of the parameter space and its underlying symmetries, in addition to the lack of powerful and scalable conditioning mechanisms. In this work, we draw inspiration from the principles of connectionism to design a new architecture based on MLPs, which we term NeoMLP. We start from an MLP, viewed as a graph, and transform it from a multi-partite graph to a complete graph of input, hidden, and output nodes, equipped with high-dimensional features. We perform message passing on this graph and employ weight-sharing via self-attention among all the nodes. NeoMLP has a built-in mechanism for conditioning through the hidden and output nodes, which function as a set of latent codes, and as such, NeoMLP can be used straightforwardly as a conditional neural field. We demonstrate the effectiveness of our method by fitting high-resolution signals, including multi-modal audio-visual data. Furthermore, we fit datasets of neural representations, by learning instance-specific sets of latent codes using a single backbone architecture, and then use them for downstream tasks, outperforming recent state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neomlp.

From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields

TL;DR

Neural fields enable continuous signal encoding but struggle with downstream tasks due to parameter-space symmetries and limited conditioning. The authors propose NeoMLP, a graph-based reinterpretation of MLPs with a complete input–hidden–output node graph and weight-sharing self-attention, enabling per-signal latent conditioning via the (-reps) and (-sets) framework. The approach yields strong results on high-resolution audio–visual data and neural representations, with competitive downstream performance and open-source code. This work offers a scalable, unified conditioning mechanism for neural fields and opens avenues for multi-modal and dataset-level representations.

Abstract

Neural fields (NeFs) have recently emerged as a state-of-the-art method for encoding spatio-temporal signals of various modalities. Despite the success of NeFs in reconstructing individual signals, their use as representations in downstream tasks, such as classification or segmentation, is hindered by the complexity of the parameter space and its underlying symmetries, in addition to the lack of powerful and scalable conditioning mechanisms. In this work, we draw inspiration from the principles of connectionism to design a new architecture based on MLPs, which we term NeoMLP. We start from an MLP, viewed as a graph, and transform it from a multi-partite graph to a complete graph of input, hidden, and output nodes, equipped with high-dimensional features. We perform message passing on this graph and employ weight-sharing via self-attention among all the nodes. NeoMLP has a built-in mechanism for conditioning through the hidden and output nodes, which function as a set of latent codes, and as such, NeoMLP can be used straightforwardly as a conditional neural field. We demonstrate the effectiveness of our method by fitting high-resolution signals, including multi-modal audio-visual data. Furthermore, we fit datasets of neural representations, by learning instance-specific sets of latent codes using a single backbone architecture, and then use them for downstream tasks, outperforming recent state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neomlp.

Paper Structure

This paper contains 31 sections, 5 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: The connectivity graphs of MLP and NeoMLP. NeoMLP performs message passing on the MLP graph. Going from MLP to NeoMLP, we use a fully connected graph and high-dimensional node features. In NeoMLP, the traditional notion of layers of neurons, as well as the asynchronous layer-wise propagation, cease to exist. Instead, we use synchronous message passing with weight-sharing via self-attention among all the nodes. NeoMLP has three types of nodes: input, hidden, and output nodes. The input is fed to NeoMLP through the input nodes, while the output nodes capture the output of the network.
  • Figure 2: The architecture of NeoMLP. We pass each input dimension through an RFF layer followed by a linear layer, and then add individual input embeddings to each input. The transformed inputs, alongside the embeddings for the hidden and output nodes, comprise the inputs to NeoMLP. NeoMLP has $L$ layers of residual self-attention and non-linear transformations. We capture the output that corresponds to the output nodes and pass it through a linear layer to get the final output of the network.
  • Figure 3: The hidden and output embeddings constitute a set of latent codes for each signal, and can be used as neural representations for downstream tasks. We term these neural representations as $\nu$-reps, and the datasets of neural representations as $\nu$-sets.
  • Figure 4: Examples frames from fitting the "bikes" video clip. The first row shows the groundtruth, while the second and the third row show the reconstructions obtained using NeoMLP and Siren, respectively. We observe that NeoMLP learns to reconstruct the video with much greater fidelity.
  • Figure 5: Predictions for the "Bach" audio clip. The first row shows the groundtruth signal, while the second and third row show the reconstructions from NeoMLP and Siren, respectively.
  • ...and 2 more figures