MultiMax: Sparse and Multi-Modal Attention Learning

Yuxuan Zhou; Mario Fritz; Margret Keuper

MultiMax: Sparse and Multi-Modal Attention Learning

Yuxuan Zhou, Mario Fritz, Margret Keuper

TL;DR

Through comprehensive analysis and evaluation, it is shown that MultiMax successfully produces a distribution that supresses irrelevant entries while preserving multimodality, with benefits in image classification, language modeling and machine translation.

Abstract

SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to the Argmax function, a significant amount of probability mass is distributed to other, residual entries, leading to poor interpretability and noise. Although sparsity can be achieved by a family of SoftMax variants, they often require an alternative loss function and do not preserve multi-modality. We show that this trade-off between multi-modality and sparsity limits the expressivity of SoftMax as well as its variants. We provide a solution to this tension between objectives by proposing a piece-wise differentiable function, termed MultiMax, which adaptively modulates the output distribution according to input entry range. Through comprehensive analysis and evaluation, we show that MultiMax successfully produces a distribution that supresses irrelevant entries while preserving multimodality, with benefits in image classification, language modeling and machine translation. The code is available at https://github.com/ZhouYuxuanYX/MultiMax.

MultiMax: Sparse and Multi-Modal Attention Learning

TL;DR

Abstract

Paper Structure (39 sections, 6 theorems, 18 equations, 10 figures, 8 tables)

This paper contains 39 sections, 6 theorems, 18 equations, 10 figures, 8 tables.

Introduction
Related Work
Background, Metrics, and Analysis
Background
Sparsity and Multi-Modality Trade-off
Quantifying Multi-Modality and Sparsity of Reweighting Functions
Proofing the Trade-off
MultiMax
First-order MultiMax
Improved Pareto Efficiency
Generalization
Generalization to other activations
Generalization to higher-order polynomials
Generalization beyond Attention
Computational Efficiency
...and 24 more sections

Key Result

Lemma 3.4

$\mathcal{S(\boldsymbol{x})}$ is monotonically decreasing w.r.t. $\phi(\boldsymbol{x})_l$. (See proof:1 for the proof.)

Figures (10)

Figure 1: We evaluate SoftMax, SparseMax, EntMax, EvSoftMax and MultiMax (using the parameters of a hidden layer MultiMax trained on ImageNet directly) functions on a series of example input points $\boldsymbol{v} \in \mathbb{R}^{3}$ and project the resulting distribution on a simplex $\Delta^2$. Informally, the interior of the simplex stands for trimodal distributions, the edges constitute the set of bimodal distributions, and the vertices are unimodal distributions. Notably, the above figures highlight the advantage of MultiMax's multi-modality. EntMax, Sparsemax and SoftMax with small temperature (blue colored line) yield a (quasi) uni-modal distribution, which ignore the second largest entry. In contrary, SoftMax with higher temperatures (green and orange colored line) fails to ignore the negative entry.
Figure 2: Illustration of different reweighting functions in the two-dimensional case. It can be seen clearly that MultiMax weigh the entries at small and large value ranges in a different manner, thus it does not suffer from the trade-off between sparse and multi-modal.
Figure 3: The learned modulator functions $\sigma$ (\ref{['eq:compact_rec2']}) at each layer, comparing to identity mapping of the SoftMax input $\boldsymbol{x}$ (dashed black line). All layers except for the first two converge to a form that is consistent to our analysis, i.e., low temperature (steep slope) for small entries and high temperature (flat slope) for large entries.
Figure 4: Patch similarities for each layer and at different epochs. Darker color denotes the patch similarities at a larger training epoch.
Figure 5: Histograms of the attention scores at each layer. MultiMax attention is distributed towards both ends: small scores are pushed closer to zero and more scores lie above 0.1.
...and 5 more figures

Theorems & Definitions (16)

Definition 3.1
Definition 3.2
Definition 3.3
Lemma 3.4
Proposition 3.5
Definition 4.1
Proposition 4.2
Proposition 4.3
Lemma 1.1
Lemma 1.2
...and 6 more

MultiMax: Sparse and Multi-Modal Attention Learning

TL;DR

Abstract

MultiMax: Sparse and Multi-Modal Attention Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (16)