Table of Contents
Fetching ...

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

Jungmin Yun, Mihyeon Kim, Youngbin Kim

TL;DR

This work tackles the high cost of self-attention in transformer models for document classification by combining fuzzy-based token pruning of attention keys/values with a Slot Attention–inspired token combining module. The method progressively prunes uninformative tokens while employing uncertainty-aware fuzzy logic to mitigate mispruning and replaces a layer with a combining module to compress sequences further. Empirical results across six datasets show consistent accuracy and F1 gains over BERT, plus substantial memory reductions ($\approx$0.61×) and speedups ($\approx$1.64×) when optimizing the layer placement and number of combination tokens. The findings demonstrate a synergistic effect when integrating pruning and combining, enabling more efficient, high-performance document classification with transformer architectures.

Abstract

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts with all tokens, including the ones unfavorable to classification performance. To overcome these challenges, we propose integrating two strategies: token pruning and token combining. Token pruning eliminates less important tokens in the attention mechanism's key and value as they pass through the layers. Additionally, we adopt fuzzy logic to handle uncertainty and alleviate potential mispruning risks arising from an imbalanced distribution of each token's importance. Token combining, on the other hand, condenses input sequences into smaller sizes in order to further compress the model. By integrating these two approaches, we not only improve the model's performance but also reduce its computational demands. Experiments with various datasets demonstrate superior performance compared to baseline models, especially with the best improvement over the existing BERT model, achieving +5%p in accuracy and +5.6%p in F1 score. Additionally, memory cost is reduced to 0.61x, and a speedup of 1.64x is achieved.

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

TL;DR

This work tackles the high cost of self-attention in transformer models for document classification by combining fuzzy-based token pruning of attention keys/values with a Slot Attention–inspired token combining module. The method progressively prunes uninformative tokens while employing uncertainty-aware fuzzy logic to mitigate mispruning and replaces a layer with a combining module to compress sequences further. Empirical results across six datasets show consistent accuracy and F1 gains over BERT, plus substantial memory reductions (0.61×) and speedups (1.64×) when optimizing the layer placement and number of combination tokens. The findings demonstrate a synergistic effect when integrating pruning and combining, enabling more efficient, high-performance document classification with transformer architectures.

Abstract

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts with all tokens, including the ones unfavorable to classification performance. To overcome these challenges, we propose integrating two strategies: token pruning and token combining. Token pruning eliminates less important tokens in the attention mechanism's key and value as they pass through the layers. Additionally, we adopt fuzzy logic to handle uncertainty and alleviate potential mispruning risks arising from an imbalanced distribution of each token's importance. Token combining, on the other hand, condenses input sequences into smaller sizes in order to further compress the model. By integrating these two approaches, we not only improve the model's performance but also reduce its computational demands. Experiments with various datasets demonstrate superior performance compared to baseline models, especially with the best improvement over the existing BERT model, achieving +5%p in accuracy and +5.6%p in F1 score. Additionally, memory cost is reduced to 0.61x, and a speedup of 1.64x is achieved.
Paper Structure (14 sections, 9 equations, 1 figure, 6 tables)

This paper contains 14 sections, 9 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overall architecture of our purposed model Model architecture is composed of several Token-pruned Attention Blocks, a Token Combining Module, and Attention Blocks. (Left): Fuzzy-based Token Pruning Self-attention In each layer, fuzzy-based pruning method removes tokens using importance score and fuzzy membership function. (Right): Token Combining Module This module apportions embedded tokens to each of the combination token using a similarity matrix between them.