Table of Contents
Fetching ...

NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks

Alex Reneau, Jerry Yao-Chieh Hu, Zhongfang Zhuang, Ting-Chun Liu, Xiang He, Judah Goldfeder, Nadav Timor, Allen G Roush, Ravid Shwartz-Ziv

TL;DR

NdLinear proposes a drop-in N-D tensor linear layer that preserves the native multi-dimensional structure by applying sequential, per-mode linear transforms, dramatically reducing parameters and compute compared to flattened layers. The authors prove expressivity preservation under a rank-1 Tucker (Kronecker) parameterization and demonstrate favorable VC-dimension scaling, alongside empirical evidence of strong performance across NLP, time series, tabular, and vision tasks with substantial efficiency gains. A key contribution is NdLinear-LoRA, which achieves up to 9× fewer trainable parameters while matching or exceeding LoRA on reasoning tasks, plus extensive ablations showing robustness to hyperparameters and minimal overhead. The work provides practical deployment guidance, showing NdLinear excels on axis-separable data but may underperform on highly entangled patterns, and it offers a paradigm shift away from flattening toward structure-preserving neural architectures with broad potential impact on on-device and federated learning.

Abstract

In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce $\mathit{NdLinear}$, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our $\mathit{NdLinear-LoRA}$ matches or exceeds standard LoRA on language reasoning tasks using up to $9\times$ fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.

NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks

TL;DR

NdLinear proposes a drop-in N-D tensor linear layer that preserves the native multi-dimensional structure by applying sequential, per-mode linear transforms, dramatically reducing parameters and compute compared to flattened layers. The authors prove expressivity preservation under a rank-1 Tucker (Kronecker) parameterization and demonstrate favorable VC-dimension scaling, alongside empirical evidence of strong performance across NLP, time series, tabular, and vision tasks with substantial efficiency gains. A key contribution is NdLinear-LoRA, which achieves up to 9× fewer trainable parameters while matching or exceeding LoRA on reasoning tasks, plus extensive ablations showing robustness to hyperparameters and minimal overhead. The work provides practical deployment guidance, showing NdLinear excels on axis-separable data but may underperform on highly entangled patterns, and it offers a paradigm shift away from flattening toward structure-preserving neural architectures with broad potential impact on on-device and federated learning.

Abstract

In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce , a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our matches or exceeds standard LoRA on language reasoning tasks using up to fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.

Paper Structure

This paper contains 75 sections, 5 theorems, 13 equations, 7 figures, 19 tables, 1 algorithm.

Key Result

Theorem 3.1

An NdLinear network with $P_{\text{nd}} = d(a + b + c)$ parameters for tensor dimensions $(a, b, c)$ and hidden dimension $d$ maintains VC-dimension $\Theta(P_{\text{nd}} \log P_{\text{nd}})$ as $d \to \infty$, matching the scaling of vanilla linear layers with $P_{\text{std}}$ parameters.

Figures (7)

  • Figure 1: NdLinear excels on separable tasks but struggles with entangled patterns Performance comparison as task structure varies from purely separable ($\alpha=0$) to fully entangled ($\alpha=1$). The crossover at $\alpha \approx 0.45$ indicates NdLinear outperforms dense MLPs when tasks have $<$45% entanglement. This provides clear deployment guidance: use NdLinear for axis-aligned domains (spectrograms, time series, tabular data) but prefer dense layers for spatially entangled tasks (dense vision, XOR-like patterns).
  • Figure 2: Training and evaluation loss curves during OPT model pretraining. NdLinear variants consistently achieve lower loss values. x-axis represents the number of training steps.
  • Figure 3: NdLinear's efficiency. Reduced ViT model parameter counts on CIFAR-10 and CIFAR-100 for a distillation task.
  • Figure 4: DiT achieving lower (better) FID scores for image generation on ImageNet-100 when trained from scratch with comparable parameters.
  • Figure 5: $\alpha = 0.1$ (hard): Narrow bump, very challenging.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Theorem 3.1: Informal; see Appendix \ref{['app:proofs']} for formal statement
  • Theorem C.1: VC-Dimension of NdLinear
  • proof
  • Theorem C.2: Parameter Count Lower Bound
  • proof
  • Proposition C.1: Peak Memory Overhead Bound
  • proof
  • Proposition C.2: Exact FLOP Count