Towards Signal Processing In Large Language Models

Prateek Verma; Mert Pilanci

Towards Signal Processing In Large Language Models

Prateek Verma, Mert Pilanci

TL;DR

The paper addresses the lack of explicit signal-processing mechanisms inside LLMs by proposing learnable time-frequency representations of intermediate activations that are filtered and reconstructed under a causal framework. It introduces a learnable front end implemented as a 2-layer CNN with $M=144$ filters and extends it to multi-scale filters, with optional token-adaptive weighting via a small Transformer, all trained end-to-end through next-token prediction and evaluated on text-8 and non-causal audio tasks. Key findings show faster convergence and improved performance with only a tiny parameter overhead, plus interpretable learned kernels that reveal how embeddings traverse the latent space. This work suggests a new paradigm for integrating signal-processing inside neural architectures, with potential cross-domain benefits for efficiency and interpretability in generative models and beyond.

Abstract

This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate activation signal of an LLM. Once we decompose every activation signal across tokens into a time-frequency representation, we learn how to filter and reconstruct them, with all components learned from scratch, to predict the next token given the previous context. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance by adding a minuscule number of extra parameters when trained for the same epochs. We hope this work paves the way for algorithms exploring signal processing inside the signals found in neural architectures like LLMs and beyond.

Towards Signal Processing In Large Language Models

TL;DR

filters and extends it to multi-scale filters, with optional token-adaptive weighting via a small Transformer, all trained end-to-end through next-token prediction and evaluated on text-8 and non-causal audio tasks. Key findings show faster convergence and improved performance with only a tiny parameter overhead, plus interpretable learned kernels that reveal how embeddings traverse the latent space. This work suggests a new paradigm for integrating signal-processing inside neural architectures, with potential cross-domain benefits for efficiency and interpretability in generative models and beyond.

Abstract

Paper Structure (17 sections, 4 equations, 3 figures, 1 table)

This paper contains 17 sections, 4 equations, 3 figures, 1 table.

Introduction and Related Work
Dataset
Background
Fourier Transform Preliminaries
Learnable Time-Frequency Representation
Short Time Fourier Transform As Filter-Bank
Filtering
Methodology
Finding Signals Inside Large Language Models
Learnable Time-Frequency Representation And Filtering Over Signals
Multi-Scale Time-Frequency Representation And Filtering Over Signals
Making Weights Adaptive Across Tokens
Results And Discussion
Performance In Non-Causal Setups
Performance for Filtering Approaches And Speedups
...and 2 more sections

Figures (3)

Figure 1: (A) Incorporating the idea of doing signal processing inside a Large Language Model. We find signals between every decoder layer across token dimensions as 1-D signals. Each of these signals is then decomposed into a time-frequency representation and filtered and added back akin to a residual block (B) Path interpretation, where our method can be interpreted as filtering the latent space of the path of how intermediate embeddings of text tokens traverse in high-dimensional space. We can see embeddings with black marker follow circular path and constrained to certain regions. We design filters to learn these paths and constraint them or shape the trajectory. (C) We compute a time-frequency representation for each signal via 1-D causal filters. Each of the decomposed signals, with a relu non-linearity is then amplified/supressed by learned weights and added back to original signal to get the filtered signal (D) Figure from sainath2015learning. We draw inspiration from learning a time-frequency representation used in speech recognition that learns a mel-spectrogram-like representation from scratch using convolutional filters rather than a Fourier-based representation.
Figure 2: (C) We draw inspiration from a classic signal processing pipeline to bring advanced ideas in filtering to large language models. We take each of the original signals, decompose them into a time-frequency representation. We then learn a time-frequency mask on the learned representation, which allows us to learn time-varying filters as opposed to fixed filters across token dimension. (A) We have drawn inspiration from source separation pipeline, where spectrograms are taken, and time-frequency masked are learned to multiply the spectrogram representation to supress the signals were are not interested: Figure from reghunath2023predominant (B) TASNET as shown in luo2019conv proposed an architecture where a 1-D convolutional filter learns to decompose a signal and a series of convolutional filters are used to learn a mask which is used to turn-on or off the decomposed signal and are multiplied together to retain the signal components of interest. We can see parallels to our work in (C), where a Transformer decoder learns a t-f mask which allows the weights of the filter operating on the learned t-f representation varying across token/context dimension.
Figure 3: We compare performance of our proposed architecture that does signal processing on intermediate embeddings compared with that of baseline model for text-8. We can see that we achieve 40-45% faster convergence. When we train it for the same duration of time, we see that we achieve close to 0.02 validation loss improvement.For text-8 trained LLM, we show how the filters look like for learned multi-scale time-frequency representation in the first layer. We see that the basis functions are not sinusoidal, emphasizing our hypothesis that we need to learn a time-frequency representation from scratch, which is optimal for the signals in the latent space between decoder layers. This plot is for the kernels learned after the first decoder layer.

Towards Signal Processing In Large Language Models

TL;DR

Abstract

Towards Signal Processing In Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)