Table of Contents
Fetching ...

Structured Multidimensional Representation Learning for Large Language Models

Alaa El Ichi, Khalide Jbilou, Mohamed El Guide, Franck Dufrenois

TL;DR

This work introduces a structured spectral factorization of the embedding space based on the L-product for third-order tensors, resulting in a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics.

Abstract

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.

Structured Multidimensional Representation Learning for Large Language Models

TL;DR

This work introduces a structured spectral factorization of the embedding space based on the L-product for third-order tensors, resulting in a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics.

Abstract

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.
Paper Structure (65 sections, 5 theorems, 50 equations, 7 figures, 16 tables, 2 algorithms)

This paper contains 65 sections, 5 theorems, 50 equations, 7 figures, 16 tables, 2 algorithms.

Key Result

Theorem 4.7

\newlabelthm:L-SVD Let $\mathcal{A}\in\mathbb{R}^{m\times n\times p}$. Then there exist $\mathcal{L}$-orthogonal tensors $\mathcal{U}\in\mathbb{R}^{m\times m\times p}$ and $\mathcal{V}\in\mathbb{R}^{n\times n\times p}$, and an f-diagonal tensor $\mathcal{S}\in\mathbb{R}^{m\times n\times p}$ such th The diagonal tubes $\mathcal{S}(i,i,:)$ are called singular tubes; their $\ell_2$-norms play the rol

Figures (7)

  • Figure 5.1: Tensorization of token embeddings. The matrix $X\in\mathbb{R}^{T\times d}$ is reshaped into $\mathcal{X}\in\mathbb{R}^{T\times d_s\times p}$ with $d_s=d/p$ by splitting the feature dimension into $p$ blocks of width $d_s$. The third mode ($p$) is the tube dimension used by the $\mathcal{L}$-product; applying $\mathcal{L}$ along mode-3 produces $p$ transform-domain frontal slices $\widehat{X}^{(k)}\in\mathbb{R}^{T\times d_s}$.
  • Figure 6.1: Encoder parameters versus test accuracy on IMDB (left) and AG News (right). Error bars show $\pm 1$ std over 3 seeds. Tensor models achieve competitive or superior accuracy while using approximately $4\times$ fewer encoder parameters at $p=4$.
  • Figure 6.2: PE strategy comparison on IMDB at $p=4$ (mean $\pm$ std, 3 seeds). The dashed line shows the full-width Std baseline. All strategies outperform Std, and the total spread across strategies is 0.49 pp.
  • Figure 6.3: Accuracy gap (best T4 minus Std) as a function of model width $d$. The shaded band indicates a $\pm 0.3$ pp parity region. At $d=768$, the tensor model reaches parity while compressing the encoder by $4\times$.
  • Figure A.1: Slice-wise equivalence of the tensor Transformer under the $\mathcal{L}$-product framework. The tensor is transformed along mode-3, processed independently slice-by-slice by $p$ compact Transformers (dimension $d_s$), stacked back, and mapped to the original domain via $\mathcal{L}^{-1}$.
  • ...and 2 more figures

Theorems & Definitions (32)

  • Definition 3.1: $n$-mode product
  • remark 1
  • Definition 4.1: $\mathcal{L}$-transform
  • Definition 4.2: Facewise product
  • Definition 4.3: $\mathcal{L}$-product
  • Definition 4.4: $\mathcal{L}$-identity tensor
  • Definition 4.5: $\mathcal{L}$-transpose
  • Definition 4.6: Structured tensors under the $\mathcal{L}$-product
  • Theorem 4.7: $\mathcal{L}$-SVD
  • remark 2: Computation
  • ...and 22 more