Table of Contents
Fetching ...

PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

Burc Gokden

TL;DR

This work introduces PLDR-LLM, a foundational model whose deductive outputs are captured by a tensor trio $(A_{LM}, A_{P}, G_{LM})$ that define the attention mechanism. It demonstrates that a learned, input-invariant tensor operator $G_{LM}$ can replace the deep PLGA network at inference, enabling straightforward caching strategies and maintaining nearly identical inductive outputs with small perturbations. The study provides extensive ablations comparing learnable, predefined, and random tensor operators, shows SDPA as a special case when $G_{LM}$ is identity, and reports competitive zero-shot benchmarks alongside substantial inference-time speedups from KV-cache and G-cache. The results imply a fundamental training–inference asymmetry and suggest that the learned singularity of the deductive outputs yields a robust, generalizable operator that can serve as a cache-friendly backbone for future language models, with practical implications for efficient inference on large-scale deployments.

Abstract

We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor $\mathbf{G}_{LM}$ to replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ is predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV-cache and G-cache.

PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

TL;DR

This work introduces PLDR-LLM, a foundational model whose deductive outputs are captured by a tensor trio that define the attention mechanism. It demonstrates that a learned, input-invariant tensor operator can replace the deep PLGA network at inference, enabling straightforward caching strategies and maintaining nearly identical inductive outputs with small perturbations. The study provides extensive ablations comparing learnable, predefined, and random tensor operators, shows SDPA as a special case when is identity, and reports competitive zero-shot benchmarks alongside substantial inference-time speedups from KV-cache and G-cache. The results imply a fundamental training–inference asymmetry and suggest that the learned singularity of the deductive outputs yields a robust, generalizable operator that can serve as a cache-friendly backbone for future language models, with practical implications for efficient inference on large-scale deployments.

Abstract

We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor to replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for (G-cache) and KV-cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDPA) is a special case of PLDR-LLM where is predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV-cache and G-cache.

Paper Structure

This paper contains 13 sections, 2 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Train and validation loss/accuracy curves for PLDR-LLMs in table \ref{['table7']}. Train loss is captured as a running loss at every 2000 steps. Validation loss is measured at every 12000 steps using 2000 batches/rank from part of RefinedWeb dataset that is not used in pretraining.