Structured Multidimensional Representation Learning for Large Language Models

Alaa El Ichi; Khalide Jbilou; Mohamed El Guide; Franck Dufrenois

Structured Multidimensional Representation Learning for Large Language Models

Alaa El Ichi, Khalide Jbilou, Mohamed El Guide, Franck Dufrenois

TL;DR

This work introduces a structured spectral factorization of the embedding space based on the L-product for third-order tensors, resulting in a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics.

Abstract

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.

Structured Multidimensional Representation Learning for Large Language Models

TL;DR

Abstract

Paper Structure (65 sections, 5 theorems, 50 equations, 7 figures, 16 tables, 2 algorithms)

This paper contains 65 sections, 5 theorems, 50 equations, 7 figures, 16 tables, 2 algorithms.

Introduction
Related Work
Positioning of the proposed L-product tensorization among efficient Transformer methods
Preliminaries and notation
Tensors, entries, slices, and fibers
Matricization (unfolding)
$n$-mode product
The $\mathcal{L}$-Product
$\mathcal{L}$-transform and facewise multiplication
Identity, transpose, orthogonality, and invertibility
The $\mathcal{L}$-SVD and ranks
Truncated $\mathcal{L}$-SVD (spectral truncation)
Special cases: t-product and DCT-based products
Tensor LLMs under the $\mathcal{L}$-product framework
Tensorization of token embeddings
...and 50 more sections

Key Result

Theorem 4.7

\newlabelthm:L-SVD Let $\mathcal{A}\in\mathbb{R}^{m\times n\times p}$. Then there exist $\mathcal{L}$-orthogonal tensors $\mathcal{U}\in\mathbb{R}^{m\times m\times p}$ and $\mathcal{V}\in\mathbb{R}^{n\times n\times p}$, and an f-diagonal tensor $\mathcal{S}\in\mathbb{R}^{m\times n\times p}$ such th The diagonal tubes $\mathcal{S}(i,i,:)$ are called singular tubes; their $\ell_2$-norms play the rol

Figures (7)

Figure 5.1: Tensorization of token embeddings. The matrix $X\in\mathbb{R}^{T\times d}$ is reshaped into $\mathcal{X}\in\mathbb{R}^{T\times d_s\times p}$ with $d_s=d/p$ by splitting the feature dimension into $p$ blocks of width $d_s$. The third mode ($p$) is the tube dimension used by the $\mathcal{L}$-product; applying $\mathcal{L}$ along mode-3 produces $p$ transform-domain frontal slices $\widehat{X}^{(k)}\in\mathbb{R}^{T\times d_s}$.
Figure 6.1: Encoder parameters versus test accuracy on IMDB (left) and AG News (right). Error bars show $\pm 1$ std over 3 seeds. Tensor models achieve competitive or superior accuracy while using approximately $4\times$ fewer encoder parameters at $p=4$.
Figure 6.2: PE strategy comparison on IMDB at $p=4$ (mean $\pm$ std, 3 seeds). The dashed line shows the full-width Std baseline. All strategies outperform Std, and the total spread across strategies is 0.49 pp.
Figure 6.3: Accuracy gap (best T4 minus Std) as a function of model width $d$. The shaded band indicates a $\pm 0.3$ pp parity region. At $d=768$, the tensor model reaches parity while compressing the encoder by $4\times$.
Figure A.1: Slice-wise equivalence of the tensor Transformer under the $\mathcal{L}$-product framework. The tensor is transformed along mode-3, processed independently slice-by-slice by $p$ compact Transformers (dimension $d_s$), stacked back, and mapped to the original domain via $\mathcal{L}^{-1}$.
...and 2 more figures

Theorems & Definitions (32)

Definition 3.1: $n$-mode product
remark 1
Definition 4.1: $\mathcal{L}$-transform
Definition 4.2: Facewise product
Definition 4.3: $\mathcal{L}$-product
Definition 4.4: $\mathcal{L}$-identity tensor
Definition 4.5: $\mathcal{L}$-transpose
Definition 4.6: Structured tensors under the $\mathcal{L}$-product
Theorem 4.7: $\mathcal{L}$-SVD
remark 2: Computation
...and 22 more

Structured Multidimensional Representation Learning for Large Language Models

TL;DR

Abstract

Structured Multidimensional Representation Learning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (32)