Table of Contents
Fetching ...

MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation

Maike Behrendt, Stefan Sylvius Wagner, Stefan Harmeling

TL;DR

MaxPoolBERT introduces lightweight refinements to BERT by enriching the CLS token through depth- and width-wise aggregation across layers and tokens. It studies three variants—Max_CLS, MHA, and Max_Seq+MHA—and a combined MaxSeq+MHA approach, which yields the strongest and most consistent improvements on GLUE, particularly in low-resource scenarios, without requiring pretraining. The results show an average gain of about $1.25$ points over BERT base, with notable robustness on small datasets and some transferability to RoBERTa. The approach is simple to apply during fine-tuning and adds minimal overhead, suggesting practical benefits for sentence-level classification tasks where data is limited.

Abstract

The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we study lightweight extensions to BERT that refine the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach, called MaxPoolBERT, enhances BERT's classification accuracy (especially on low-resource tasks) without requiring new pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance than the standard BERT base model on low resource tasks of the GLUE benchmark.

MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation

TL;DR

MaxPoolBERT introduces lightweight refinements to BERT by enriching the CLS token through depth- and width-wise aggregation across layers and tokens. It studies three variants—Max_CLS, MHA, and Max_Seq+MHA—and a combined MaxSeq+MHA approach, which yields the strongest and most consistent improvements on GLUE, particularly in low-resource scenarios, without requiring pretraining. The results show an average gain of about points over BERT base, with notable robustness on small datasets and some transferability to RoBERTa. The approach is simple to apply during fine-tuning and adds minimal overhead, suggesting practical benefits for sentence-level classification tasks where data is limited.

Abstract

The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we study lightweight extensions to BERT that refine the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach, called MaxPoolBERT, enhances BERT's classification accuracy (especially on low-resource tasks) without requiring new pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance than the standard BERT base model on low resource tasks of the GLUE benchmark.

Paper Structure

This paper contains 25 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: MaxPoolBERT performs best on low-resource datasets. We show that our methods, in particular MaxPoolBERT, provide significant improvements for smaller datasets indicating that the model learns a better representation during fine-tuning (top-left).
  • Figure 2: Comparison of four BERT architectures for sequence classification.(Left above) Classical BERT for sequence classification architecture. (Right above) Applying max-pooling on the token embeddings of the [CLS] token over the last $k$ layers. (Left below) Adding an additional MHA layer before classification. (Right below) MaxPoolBERT architecture: After the Nth layer (N = 12 for BERT base), we apply a sequence-wide max-pooling operation over the last $k$ layers (we used $k=3$). The [CLS] token can then attend to every token after the max-pooling and the resulting [CLS] token embedding is used for classification.
  • Figure 3: Accuracies for the GLUE benchmark with error bars. We show the standard deviation between three fine-tuning runs with three random seeds. Note that the y-axis is shifted but scaled equally across tasks.