Table of Contents
Fetching ...

Rethinking the shape convention of an MLP

Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, Da-shan Shiu

TL;DR

The paper questions the standard narrow–wide–narrow MLP pattern and proposes wide–narrow–wide (Hourglass) blocks where skip connections act at expanded dimensions while residuals pass through a narrow bottleneck. It argues, supported by theory and experiments on MNIST and ImageNet-32, that operating in higher-dimensional latent spaces enhances incremental refinement and parameter efficiency, achievable even with a fixed random input projection. Systematic architectural searches show Hourglass MLPs consistently outperform conventional designs on performance–parameter Pareto frontiers, especially as budgets grow, favoring deeper, narrower bottlenecks with large latent dimensions. The findings suggest broader applicability to Transformers and other residual architectures, potentially enabling more compute-efficient large-scale models and new design paradigms for skip connections.

Abstract

Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow design where skip connections operate at the input/output dimensions while processing occurs in expanded hidden spaces. We challenge this convention by proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections operate at expanded dimensions while residual computation flows through narrow bottlenecks. This inversion leverages higher-dimensional spaces for incremental refinement while maintaining computational efficiency through parameter-matched designs. Implementing Hourglass MLPs requires an initial projection to lift input signals to expanded dimensions. We propose that this projection can remain fixed at random initialization throughout training, enabling efficient training and inference implementations. We evaluate both architectures on generative tasks over popular image datasets, characterizing performance-parameter Pareto frontiers through systematic architectural search. Results show that Hourglass architectures consistently achieve superior Pareto frontiers compared to conventional designs. As parameter budgets increase, optimal Hourglass configurations favor deeper networks with wider skip connections and narrower bottlenecks-a scaling pattern distinct from conventional MLPs. Our findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.

Rethinking the shape convention of an MLP

TL;DR

The paper questions the standard narrow–wide–narrow MLP pattern and proposes wide–narrow–wide (Hourglass) blocks where skip connections act at expanded dimensions while residuals pass through a narrow bottleneck. It argues, supported by theory and experiments on MNIST and ImageNet-32, that operating in higher-dimensional latent spaces enhances incremental refinement and parameter efficiency, achievable even with a fixed random input projection. Systematic architectural searches show Hourglass MLPs consistently outperform conventional designs on performance–parameter Pareto frontiers, especially as budgets grow, favoring deeper, narrower bottlenecks with large latent dimensions. The findings suggest broader applicability to Transformers and other residual architectures, potentially enabling more compute-efficient large-scale models and new design paradigms for skip connections.

Abstract

Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow design where skip connections operate at the input/output dimensions while processing occurs in expanded hidden spaces. We challenge this convention by proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections operate at expanded dimensions while residual computation flows through narrow bottlenecks. This inversion leverages higher-dimensional spaces for incremental refinement while maintaining computational efficiency through parameter-matched designs. Implementing Hourglass MLPs requires an initial projection to lift input signals to expanded dimensions. We propose that this projection can remain fixed at random initialization throughout training, enabling efficient training and inference implementations. We evaluate both architectures on generative tasks over popular image datasets, characterizing performance-parameter Pareto frontiers through systematic architectural search. Results show that Hourglass architectures consistently achieve superior Pareto frontiers compared to conventional designs. As parameter budgets increase, optimal Hourglass configurations favor deeper networks with wider skip connections and narrower bottlenecks-a scaling pattern distinct from conventional MLPs. Our findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.

Paper Structure

This paper contains 34 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Illustration of a wide-narrow-wide MLP block. The two endpoints $z_{i}$ and $z_{i+1}$ have a higher dimensionality compared to the hidden $h_i$. Skip connection thus connects two high–dimensional endpoints, rather than two low-dimensional ones in existing convention. Components that do not depend on dimensionality (e.g., normalization, element-wise nonlinearity) are omitted for clarity. (b) Illutration of a full network whose core is a stack of wide-narrow-wide MLP blocks. An input projection network $W_{\text{in}}$ is required to adapt the input dimensionality of $x$ to the dimensionality of the latent $z$. An output projection network $W_{\text{out}}$ is used to adapt to the desired task.
  • Figure 2: Generative Classification Task on MINST. (a) Performance–complexity Pareto front. Fronts are searched with each configuration repeated 5 times. "Wide–narrow–wide" MLPs outperform conventional "narrow–wide–narrow" ones. (b) Samples predicted by our proposed Hourglass model.
  • Figure 3: Generative Restoration Task - Denoising. Performance-complexity Pareto fronts on MINST and ImageNet-32 are searched with each configuration repeated 5 times. Optimal configurations are shown in Table \ref{['tab:pareto_arch_denoise']}.
  • Figure 4: Generative Restoration Task - Super-resolution. Performance-complexity Pareto fronts on MINST and ImageNet-32 are searched with each configuration repeated 5 times. Optimal configurations are shown in Table \ref{['tab:pareto_arch_sr']}.
  • Figure 5: Input projection fixed with a random projection matrix. Comparison between fixed and trainable input projection $W_{\text{in}}$ for Hourglass MLP on ImageNet-32 denoising. We use architecture $(d_z, d_h, L) = (3546, 270, 5)$. The fixed-projection model performs comparably to the trainable one.
  • ...and 2 more figures