Table of Contents
Fetching ...

Deep Generalized Max Pooling

Vincent Christlein, Lukas Spranger, Mathias Seuret, Anguelos Nicolaou, Pavel Král, Andreas Maier

TL;DR

This work tackles bias in global pooling methods that treat each activation map independently by introducing Deep Generalized Max Pooling (DGMP), a differentiable layer that balances activations across spatial locations via weight optimization on local descriptors along the depth dimension. DGMP reinterprets GMP as a neural-network layer, solving a ridge-regression objective to compute weights and produce a unit-norm global descriptor, with a single learnable parameter λ guiding pooling. Empirically, DGMP outperforms global average and max pooling on writer identification (ICDAR17-WI) and script type classification (CLamm16/CLamm17), while remaining lightweight and end-to-end trainable. The approach yields stronger, more robust representations for structured historical documents and offers potential for broader applications such as word spotting, with code publicly available.

Abstract

Global pooling layers are an essential part of Convolutional Neural Networks (CNN). They are used to aggregate activations of spatial locations to produce a fixed-size vector in several state-of-the-art CNNs. Global average pooling or global max pooling are commonly used for converting convolutional features of variable size images to a fix-sized embedding. However, both pooling layer types are computed spatially independent: each individual activation map is pooled and thus activations of different locations are pooled together. In contrast, we propose Deep Generalized Max Pooling that balances the contribution of all activations of a spatially coherent region by re-weighting all descriptors so that the impact of frequent and rare ones is equalized. We show that this layer is superior to both average and max pooling on the classification of Latin medieval manuscripts (CLAMM'16, CLAMM'17), as well as writer identification (Historical-WI'17).

Deep Generalized Max Pooling

TL;DR

This work tackles bias in global pooling methods that treat each activation map independently by introducing Deep Generalized Max Pooling (DGMP), a differentiable layer that balances activations across spatial locations via weight optimization on local descriptors along the depth dimension. DGMP reinterprets GMP as a neural-network layer, solving a ridge-regression objective to compute weights and produce a unit-norm global descriptor, with a single learnable parameter λ guiding pooling. Empirically, DGMP outperforms global average and max pooling on writer identification (ICDAR17-WI) and script type classification (CLamm16/CLamm17), while remaining lightweight and end-to-end trainable. The approach yields stronger, more robust representations for structured historical documents and offers potential for broader applications such as word spotting, with code publicly available.

Abstract

Global pooling layers are an essential part of Convolutional Neural Networks (CNN). They are used to aggregate activations of spatial locations to produce a fixed-size vector in several state-of-the-art CNNs. Global average pooling or global max pooling are commonly used for converting convolutional features of variable size images to a fix-sized embedding. However, both pooling layer types are computed spatially independent: each individual activation map is pooled and thus activations of different locations are pooled together. In contrast, we propose Deep Generalized Max Pooling that balances the contribution of all activations of a spatially coherent region by re-weighting all descriptors so that the impact of frequent and rare ones is equalized. We show that this layer is superior to both average and max pooling on the classification of Latin medieval manuscripts (CLAMM'16, CLAMM'17), as well as writer identification (Historical-WI'17).

Paper Structure

This paper contains 16 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of Deep Generalized Max Pooling. The activation volume that is computed from a convolutional layer serves as input for the DGMP layer. A linear optimization problem with D unknowns is solved using each local activation vector along the depth axis of the activation volume as linear equation with D unknowns. The output is a weighted sum of the local activation vectors.
  • Figure 2: Icdar17-WI indicative samples Samples \ref{['fig:w11']} and \ref{['fig:w12']} stem from the same writer (acc. to the ground truth) while samples \ref{['fig:w2']} and \ref{['fig:w3']} come from two different ones (IDs: 11-3-IMG_MAX_1005484, 11-3-IMG_MAX_1005478, 358-3-IMG_MAX_1031951, 7-3-IMG_MAX_10051).
  • Figure 3: Clamm16 samples of two similar fonts (excerpts of IRHT_P_000012, IRHT_P_000020).
  • Figure 4: Mean validation error per epoch of the different pooling schemes computed of five runs with different initializations. The brighter area denotes the standard deviation. Please note the logarithmic y scale for improved clarity.