Table of Contents
Fetching ...

PLDR-LLM: Large Language Model from Power Law Decoder Representations

Burc Gokden

TL;DR

It is shown that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer.

Abstract

We present the Large Language Model from Power Law Decoder Representations (PLDR-LLM), a language model that leverages non-linear and linear transformations through Power Law Graph Attention mechanism to generate well-defined deductive and inductive outputs. We pretrain the PLDR-LLMs of varying layer sizes with a small batch size of 32 and $\sim$8B tokens from the RefinedWeb dataset, and show that they achieve competitive performance in zero-shot and few-shot settings compared to scaled dot-product LLMs of similar model size reported in the literature. We show that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer. Our results indicate that the initial maximum learning rate and warm-up steps have a lasting impact on deductive outputs throughout the pretraining. We provide a detailed description of PLDR-LLM architecture, its implementation and the pretraining procedure.

PLDR-LLM: Large Language Model from Power Law Decoder Representations

TL;DR

It is shown that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer.

Abstract

We present the Large Language Model from Power Law Decoder Representations (PLDR-LLM), a language model that leverages non-linear and linear transformations through Power Law Graph Attention mechanism to generate well-defined deductive and inductive outputs. We pretrain the PLDR-LLMs of varying layer sizes with a small batch size of 32 and 8B tokens from the RefinedWeb dataset, and show that they achieve competitive performance in zero-shot and few-shot settings compared to scaled dot-product LLMs of similar model size reported in the literature. We show that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer. Our results indicate that the initial maximum learning rate and warm-up steps have a lasting impact on deductive outputs throughout the pretraining. We provide a detailed description of PLDR-LLM architecture, its implementation and the pretraining procedure.

Paper Structure

This paper contains 11 sections, 2 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Train and validation loss/accuracy curves for PLDR-LLMs. Train loss is captured as a running loss at every 2000 steps. Validation loss is measured at every 12000 steps using 2000 batches/rank from part of RefinedWeb dataset that is not used in pretraining.
  • Figure 2: DAG regularizer log loss trend during pretraining before scaling with regularizer coefficients: (a) for ${\bm{\mathsfit{A}}}_{LM}$, (b) for ${\bm{\mathsfit{A}}}_{\textbf{P}}$ and ${\bm{\mathsfit{G}}}_{LM}$. For unregularized models, DAG loss for ${\bm{\mathsfit{A}}}_{LM}$ overflows for a few thousand steps after warm-up. For regularized models, DAG loss for ${\bm{\mathsfit{A}}}_{LM}$ goes to zero quickly. Overflow and zero values are omitted on the log scale axis for the loss.
  • Figure 3: (a)-(d) Train and validation loss/accuracy curves and (e) DAG loss for ${\bm{\mathsfit{A}}}_{LM}$ for ablation of PLDR-LLMs for low learning rate, longer warm-up steps and different tokenizer model. The discontinuities in curves is due to overflow of DAG loss value.
  • Figure 4: PLDR-LLM model and multihead attention diagrams for PLDRv5 and PLDRv9 designs. PLDRv9 only differs in resizing of layers before and after residual networks for the metric learner. Feedforward network (FFN) is composed of SwiGLU and Linear layers.