Table of Contents
Fetching ...

Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

Sofiane Ennadir, Levente Zólyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao

TL;DR

This work reveals pooling as a principled driver of Transformer expressivity, introducing a formal framework to quantify how fixed and learnable pooling strategies influence the capacity to distinguish similar inputs. By deriving explicit bounds that depend on pooling type and architectural constants, the authors show contractive pooling favors global context while expansive pooling enhances local sensitivity, with learnable pooling balancing both ends. Empirical validation across vision, language, and time-series domains confirms task-dependent pooling performance and demonstrates that adaptive pooling approaches can approach or exceed fixed strategies as model size grows. The study thus provides theoretical and practical guidance for selecting or designing pooling mechanisms to match task demands and inductive biases, and points to future work on dynamic pooling and robustness considerations. $ $

Abstract

Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.

Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

TL;DR

This work reveals pooling as a principled driver of Transformer expressivity, introducing a formal framework to quantify how fixed and learnable pooling strategies influence the capacity to distinguish similar inputs. By deriving explicit bounds that depend on pooling type and architectural constants, the authors show contractive pooling favors global context while expansive pooling enhances local sensitivity, with learnable pooling balancing both ends. Empirical validation across vision, language, and time-series domains confirms task-dependent pooling performance and demonstrates that adaptive pooling approaches can approach or exceed fixed strategies as model size grows. The study thus provides theoretical and practical guidance for selecting or designing pooling mechanisms to match task demands and inductive biases, and points to future work on dynamic pooling and robustness considerations.

Abstract

Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.

Paper Structure

This paper contains 33 sections, 6 theorems, 59 equations, 5 figures, 10 tables.

Key Result

Theorem 4.2

Let $f \colon \mathcal{X} \subseteq \mathbb{R}^{n \times d} \rightarrow \mathcal{Y} \subseteq \mathbb{R}^d$ be a TBM following the framework introduced in Section sec:preliminaries. In respect to Definition def:expressivity, we have:

Figures (5)

  • Figure 1: Performance of different pooling strategies using a GPT-2 pre-trained model.
  • Figure 2: Empirical analysis of the expressivity power across modalities and pooling strategies. Left: Mean pooled‐output distance $\gamma$ versus perturbation $\epsilon$ across modalities highlighting the behavior of various methods. Right: pooled‐output distances for similar and dissimilar inputs, exemplifying expressivity of different strategies.
  • Figure 3: Left: Cosine similarity between W-Avg pooling and other pooling methods, showing task-dependent alignment. Right: The distribution of the learned weights in the W-avg pooling, illustrating the adaptability of the pooling mechanism.
  • Figure 4: Empirical analysis of the expressivity power across modalities and pooling strategies. Left: Mean pooled‐output distance $\gamma$ versus perturbation $\epsilon$ across modalities highlighting the behavior of various methods. Right: pooled‐output distances for similar and dissimilar inputs, exemplifying expressivity of different strategies.
  • Figure 5: Left: Cosine similarity of weighted average pooling with other pooling methods. Right: Learned weight distributions.

Theorems & Definitions (10)

  • Definition 4.1
  • Theorem 4.2
  • Lemma 4.3
  • Lemma 4.4
  • Theorem
  • proof
  • Lemma
  • proof
  • Lemma
  • proof