Can persistent homology whiten Transformer-based black-box models? A case study on BERT compression

Luis Balderas; Miguel Lastra; José M. Benítez

Can persistent homology whiten Transformer-based black-box models? A case study on BERT compression

Luis Balderas, Miguel Lastra, José M. Benítez

TL;DR

The paper tackles the challenge of making Transformer-based BERT models both explainable and deployment-friendly by applying zero-dimensional persistent homology to neuron outputs. The authors introduce OBCE, which quantifies neuron importance via the merge radius $r_f$ from persistent diagrams and prunes units using percentile thresholds, yielding compressed models. Experiments on the GLUE benchmark show substantial parameter reductions (to $58.47\%$ for BERT Base and $52.3\%$ for BERT Large) with competitive or improved task performance, surpassing several prior compression methods. This work demonstrates that topological features can provide principled explainability and practical efficiency for large language models, enabling their use on resource-constrained devices.

Abstract

Large Language Models (LLMs) like BERT have gained significant prominence due to their remarkable performance in various natural language processing tasks. However, they come with substantial computational and memory costs. Additionally, they are essentially black-box models, challenging to explain and interpret. In this article, we propose Optimus BERT Compression and Explainability (OBCE), a methodology to bring explainability to BERT models using persistent homology, aiming to measure the importance of each neuron by studying the topological characteristics of their outputs. As a result, we can compress BERT significantly by reducing the number of parameters (58.47% of the original parameters for BERT Base, 52.3% for BERT Large). We evaluated our methodology on the standard GLUE Benchmark, comparing the results with state-of-the-art techniques and achieving outstanding results. Consequently, our methodology can "whiten" BERT models by providing explainability to its neurons and reducing the model's size, making it more suitable for deployment on resource-constrained devices.

Can persistent homology whiten Transformer-based black-box models? A case study on BERT compression

TL;DR

from persistent diagrams and prunes units using percentile thresholds, yielding compressed models. Experiments on the GLUE benchmark show substantial parameter reductions (to

for BERT Base and

for BERT Large) with competitive or improved task performance, surpassing several prior compression methods. This work demonstrates that topological features can provide principled explainability and practical efficiency for large language models, enabling their use on resource-constrained devices.

Abstract

Paper Structure (20 sections, 9 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 9 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Previous work
Persistent homology applied to machine learning problems
Brief description of the BERT model
BERT model pruning methods
Our proposal
An intuitive geometric description of persistent homology applied to LLM explanations
Explaining LLMs: BERT compression using persistent homology
Dataset selection
Using persistent homology to explain BERT Layer outputs
Evaluation of $r_f$ distribution and selection of the important units
Construction of the compressed model
Evaluation of the compressed model through the GLUE Benchmark
Empirical evaluation
Distribution of $r_f$ and selection of the more informative neurons
...and 5 more sections

Figures (7)

Figure 1: Usage of BERT involves taking a sentence, adding the special tokens [CLS] and [SEP], tokenizing the words, and using these tokens as input for the neural network.
Figure 2: Representation of the BERT architecture. It is composed of an embedding module, followed by the Encoder part, which consists of $N$ BERT Layers (12 or 24, depending on whether it's BERT Base or Large). Within it, three main components stand out: the Attention layer, the Intermediate layer, and the Output layer. After the Encoder, BERT has a Pooler layer.
Figure 3: Representation of information through the BERT model and subsequent extraction of values from intermediate dense layers. The process begins with the processing of a set of sentences. These sentences are tokenized, adding special tokens such as [CLS], which marks the beginning of a sentence, or [SEP], which marks the end. The neural network's input must have a fixed token size per instance. Since sentences can vary in length, the maximum length is found and taken as a reference, filling the remaining gaps with 0 and the [PAD] token. Once the input of $N$ sentences with length $M$ is constructed, it is fed into the neural network. In the lower right part of the image, the output of any dense layer is represented. The output is a three-dimensional matrix: the number of sentences in the input ($N$), the length of those tokenized sentences ($M$), and the number of hidden neurons in that layer. Since the [CLS] token (first column of each matrix) usually contains information that encapsulates the semantics of the entire sentence, in this methodology, we use the values of the [CLS] token for subsequent analysis. For the sake of clarity, in this example, we show the representation of the [CLS] token in two dimensions
Figure 4: Application of persistent homology on the output of a neuron at three specific moments. On the left, you can see that each of the points comprising the output becomes the center of a disk whose radius grows uniformly for all points. On the right, we represent the Birth-Death diagram for persistent homology of dimension zero. Each blue point corresponds to the disappearance of a connected component after collapsing with another. The last moment depicted in the figure represents the point at which the value of $r$ is reached for which all connected components first merge. We call this value $r_f$, and it is crucial in our methodology because it provides information about the importance of neurons based on their output within the neural network's data flow.
Figure 5: [Diagram: X-axis (Birth Time), Y-axis (Death Time. Persistence)] Here is an example of a Birth-Death persistence diagram. On the X-axis, you find the points where connected components are born. Since we use zero-dimensional persistent homology, all connected components are born at time zero. As the value of $r$ increases (Y-axis), the connected components collapse. Each time two components collapse, a point is represented. The last value below the dashed line corresponds to $r_f$ (circled in red).
...and 2 more figures

Theorems & Definitions (7)

Definition 1: affinely independent
Definition 2: $k-$simplex
Definition 3: simplicial complex
Definition 4: Vietoris-Rips VR complex
Definition 5: Chain complexes
Definition 6: Homology group
Definition 7: Persistent homology

Can persistent homology whiten Transformer-based black-box models? A case study on BERT compression

TL;DR

Abstract

Can persistent homology whiten Transformer-based black-box models? A case study on BERT compression

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (7)