Table of Contents
Fetching ...

ASMR: Activation-sharing Multi-resolution Coordinate Networks For Efficient Inference

Jason Chun Lok Li, Steven Tin Sui Luo, Le Xu, Ngai Wong

TL;DR

The paper tackles the high inference cost of implicit neural representations by proposing Activation-Sharing Multi-Resolution (ASMR), which couples multi-resolution coordinate decomposition, hierarchical modulation, and activation-sharing inference to decouple MAC from network depth. By sharing activations across grids and injecting per-level biases, ASMR attains near $O(1)$ MAC with respect to depth while preserving or improving reconstruction quality relative to vanilla SIREN. It demonstrates up to ~500× MAC reductions on high-resolution image fitting and extends effectively to natural images, audio, video, and 3D data, while enabling meta-learning and global latent structure encoding. While ASMR introduces a rasterized-data bias that can hinder smooth continuous signals like SDFs, it provides a powerful, purely implicit framework with broad practical impact for deployment under strict hardware constraints.

Abstract

Coordinate network or implicit neural representation (INR) is a fast-emerging method for encoding natural signals (such as images and videos) with the benefits of a compact neural representation. While numerous methods have been proposed to increase the encoding capabilities of an INR, an often overlooked aspect is the inference efficiency, usually measured in multiply-accumulate (MAC) count. This is particularly critical in use cases where inference throughput is greatly limited by hardware constraints. To this end, we propose the Activation-Sharing Multi-Resolution (ASMR) coordinate network that combines multi-resolution coordinate decomposition with hierarchical modulations. Specifically, an ASMR model enables the sharing of activations across grids of the data. This largely decouples its inference cost from its depth which is directly correlated to its reconstruction capability, and renders a near O(1) inference complexity irrespective of the number of layers. Experiments show that ASMR can reduce the MAC of a vanilla SIREN model by up to 500x while achieving an even higher reconstruction quality than its SIREN baseline.

ASMR: Activation-sharing Multi-resolution Coordinate Networks For Efficient Inference

TL;DR

The paper tackles the high inference cost of implicit neural representations by proposing Activation-Sharing Multi-Resolution (ASMR), which couples multi-resolution coordinate decomposition, hierarchical modulation, and activation-sharing inference to decouple MAC from network depth. By sharing activations across grids and injecting per-level biases, ASMR attains near MAC with respect to depth while preserving or improving reconstruction quality relative to vanilla SIREN. It demonstrates up to ~500× MAC reductions on high-resolution image fitting and extends effectively to natural images, audio, video, and 3D data, while enabling meta-learning and global latent structure encoding. While ASMR introduces a rasterized-data bias that can hinder smooth continuous signals like SDFs, it provides a powerful, purely implicit framework with broad practical impact for deployment under strict hardware constraints.

Abstract

Coordinate network or implicit neural representation (INR) is a fast-emerging method for encoding natural signals (such as images and videos) with the benefits of a compact neural representation. While numerous methods have been proposed to increase the encoding capabilities of an INR, an often overlooked aspect is the inference efficiency, usually measured in multiply-accumulate (MAC) count. This is particularly critical in use cases where inference throughput is greatly limited by hardware constraints. To this end, we propose the Activation-Sharing Multi-Resolution (ASMR) coordinate network that combines multi-resolution coordinate decomposition with hierarchical modulations. Specifically, an ASMR model enables the sharing of activations across grids of the data. This largely decouples its inference cost from its depth which is directly correlated to its reconstruction capability, and renders a near O(1) inference complexity irrespective of the number of layers. Experiments show that ASMR can reduce the MAC of a vanilla SIREN model by up to 500x while achieving an even higher reconstruction quality than its SIREN baseline.
Paper Structure (31 sections, 1 theorem, 5 equations, 12 figures, 12 tables)

This paper contains 31 sections, 1 theorem, 5 equations, 12 figures, 12 tables.

Key Result

Proposition 1

ASMR decouples its inference cost (in terms of MAC) from its depth $L$, and consequently its corresponding reconstruction quality.

Figures (12)

  • Figure 1: Overall framework for ASMR. (a) Multi-resolution Coordinates: The original coordinates are decomposed into multiple hierarchical levels, each with its own set of axes. To illustrate repetitive patterns, the coordinates are folded into a higher-dimensional space. (b) Hierarchical Modulation: The number of layers in the model is equal to the number of hierarchical levels. At each level (except level-0), the coordinates are first projected to the hidden dimension via modulators, then added elementwise to the activations of the corresponding layer. (c) Activation-Sharing Inference: The MAC-saving activation-sharing inference is performed by utilizing upsampling operations on both modulations and hidden features. Here, $B_{x_{i}}$ represents the base at level-$i$ along $x$-axis, while $C_{x_{i}}$ denotes the cumulative product of bases along $x$-axis from hierarchical level-0 to level-$i$ (i.e. $B_{x_{i}} \times B_{x_{i-1}} \times \ldots \times B_{x_{0}}$). A uniform base $B_{x_{i}}=2$ is used in this example.
  • Figure 2: MAC-#Params curves of SIREN, ASMR, KiloNeRF and LoE. We highlight that ASMR reduces the MAC of SIREN models of width 256 by $50\sim200\times$ to near the theoretical limit of a single-layer MLP with 32 hidden units (red dotted line).
  • Figure 2: Ultra-low MAC fitting results on Cameraman image.
  • Figure 3: Comparison of MAC-PSNR curves. Circle's area is proportional to #Params.
  • Figure 4: Audio fitting results on the LibriSpeech dataset. The mean $\pm$ std. across all samples is reported.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof