Table of Contents
Fetching ...

Attention Beats Linear for Fast Implicit Neural Representation Generation

Shuyi Zhang, Ke Liu, Jingjun Gu, Xiaoxu Cai, Zhihua Wang, Jiajun Bu, Haishuai Wang

TL;DR

This work tackles the inefficiency and aliasing challenges of implicit neural representations (INRs) when modeling high-frequency or discontinuous signals. It introduces Attention-based Localized INR (ANR), which combines a Localized Attention Layer (LAL) and an instance-specific R-Token hyper-network with an instance-agnostic MLP, augmented by variational coordinates to mitigate aliasing. The approach yields faster convergence and higher-quality reconstructions across image and 3D datasets, achieving notable PSNR gains (e.g., CelebA PSNR from 37.95 dB to 47.25 dB) while using more compact data representations. The results demonstrate ANR's generality for reconstruction and view synthesis and its potential as an efficient primitive for downstream generative tasks.

Abstract

Implicit Neural Representation (INR) has gained increasing popularity as a data representation method, serving as a prerequisite for innovative generation models. Unlike gradient-based methods, which exhibit lower efficiency in inference, the adoption of hyper-network for generating parameters in Multi-Layer Perceptrons (MLP), responsible for executing INR functions, has surfaced as a promising and efficient alternative. However, as a global continuous function, MLP is challenging in modeling highly discontinuous signals, resulting in slow convergence during the training phase and inaccurate reconstruction performance. Moreover, MLP requires massive representation parameters, which implies inefficiencies in data representation. In this paper, we propose a novel Attention-based Localized INR (ANR) composed of a localized attention layer (LAL) and a global MLP that integrates coordinate features with data features and converts them to meaningful outputs. Subsequently, we design an instance representation framework that delivers a transformer-like hyper-network to represent data instances as a compact representation vector. With instance-specific representation vector and instance-agnostic ANR parameters, the target signals are well reconstructed as a continuous function. We further address aliasing artifacts with variational coordinates when obtaining the super-resolution inference results. Extensive experimentation across four datasets showcases the notable efficacy of our ANR method, e.g. enhancing the PSNR value from 37.95dB to 47.25dB on the CelebA dataset. Code is released at https://github.com/Roninton/ANR.

Attention Beats Linear for Fast Implicit Neural Representation Generation

TL;DR

This work tackles the inefficiency and aliasing challenges of implicit neural representations (INRs) when modeling high-frequency or discontinuous signals. It introduces Attention-based Localized INR (ANR), which combines a Localized Attention Layer (LAL) and an instance-specific R-Token hyper-network with an instance-agnostic MLP, augmented by variational coordinates to mitigate aliasing. The approach yields faster convergence and higher-quality reconstructions across image and 3D datasets, achieving notable PSNR gains (e.g., CelebA PSNR from 37.95 dB to 47.25 dB) while using more compact data representations. The results demonstrate ANR's generality for reconstruction and view synthesis and its potential as an efficient primitive for downstream generative tasks.

Abstract

Implicit Neural Representation (INR) has gained increasing popularity as a data representation method, serving as a prerequisite for innovative generation models. Unlike gradient-based methods, which exhibit lower efficiency in inference, the adoption of hyper-network for generating parameters in Multi-Layer Perceptrons (MLP), responsible for executing INR functions, has surfaced as a promising and efficient alternative. However, as a global continuous function, MLP is challenging in modeling highly discontinuous signals, resulting in slow convergence during the training phase and inaccurate reconstruction performance. Moreover, MLP requires massive representation parameters, which implies inefficiencies in data representation. In this paper, we propose a novel Attention-based Localized INR (ANR) composed of a localized attention layer (LAL) and a global MLP that integrates coordinate features with data features and converts them to meaningful outputs. Subsequently, we design an instance representation framework that delivers a transformer-like hyper-network to represent data instances as a compact representation vector. With instance-specific representation vector and instance-agnostic ANR parameters, the target signals are well reconstructed as a continuous function. We further address aliasing artifacts with variational coordinates when obtaining the super-resolution inference results. Extensive experimentation across four datasets showcases the notable efficacy of our ANR method, e.g. enhancing the PSNR value from 37.95dB to 47.25dB on the CelebA dataset. Code is released at https://github.com/Roninton/ANR.
Paper Structure (30 sections, 2 theorems, 11 equations, 18 figures, 5 tables)

This paper contains 30 sections, 2 theorems, 11 equations, 18 figures, 5 tables.

Key Result

proposition thmcounterproposition

The application of multiplication and addition operations to a set of wave signals $\Psi(x, \Omega )$ results in a new set of wave signals $\Psi(x, \Omega^\prime)$.

Figures (18)

  • Figure 1: The framework overview and different INR functions: (i) (Left) The framework overview, INR reconstructs the mapping from coordinates and values. (ii)(Mid) ANR as INR functions. An ANR's parameters are data agnostic and it consists of a localized attention layer and linear components. We transform the coordinates inputs to attention Queries and get attention Keys/Values based on the previously generated R-Tokens. (iii)(Right) MLP as INR functions. Except for the modulated MLP for representation data instance, the other MLPs are data agnostic.
  • Figure 2: Reconstruction result of ANR and MLP-based INR, on CelebA ($128\times128$) and LSUN dataset ($256\times256$). Both experiments employ a training batch size 24 and are trained over 100k epochs. The red border images are the details of the original picture.
  • Figure 3: Visualization of Dot-Product Attention, Localized Attention Layer, and ANR. The term 'm' is a hyper-parameter that serves to set a threshold for attention weights.
  • Figure 4: Visualization of the Variational Coordinates. Original coordinates are sampled at specific fixed positions. Our strategy is to sample coordinates from a distribution.
  • Figure 5: Reconstruction performance on CelebA dataset and LSUN dataset. The size of each point stands for the representation sizes of data instances. The term "_mod(x)" stands for the parameters modulation setting used by TransINR chen2022transformers and IPC kim2023generalizable, which limits the predicted maximum columns of MLP weights. The depth of the whole MLP used in the INR functions is annotated beside the data points with the form "D(x)".
  • ...and 13 more figures

Theorems & Definitions (2)

  • proposition thmcounterproposition
  • proposition thmcounterproposition