Table of Contents
Fetching ...

Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

Hehe Fan, Yi Yang, Mohan Kankanhalli, Fei Wu

TL;DR

Translution addresses the challenge of combining adaptive element identification from self-attention with the relative structure encoding of convolution. It introduces Translution, a per-offset matrix-based operation, and α-Translution, a parameter-efficient variant. Empirical results on ViT and GPT architectures show Translution-based models achieve higher accuracy than standard self-attention on ImageNet and related tasks, and offer robustness to relative structure as demonstrated on Dynamic MNIST. The approach expands the modeling toolbox for vision and language, though it is computationally demanding and future work is needed for large-scale deployment and cross-modal extensions.

Abstract

When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named α-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including α-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.

Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

TL;DR

Translution addresses the challenge of combining adaptive element identification from self-attention with the relative structure encoding of convolution. It introduces Translution, a per-offset matrix-based operation, and α-Translution, a parameter-efficient variant. Empirical results on ViT and GPT architectures show Translution-based models achieve higher accuracy than standard self-attention on ImageNet and related tasks, and offer robustness to relative structure as demonstrated on Dynamic MNIST. The approach expands the modeling toolbox for vision and language, though it is computationally demanding and future work is needed for large-scale deployment and cross-modal extensions.

Abstract

When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named α-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including α-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.

Paper Structure

This paper contains 25 sections, 26 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Difference between convolution and self-attention in identifying relevant elements (blue patches) for the kernel center or query element (yellow patch). Here, convolution is assumed to operate on image patches. 1) Convolution utilizes a fixed kernel size to define a neighborhood of elements considered relevant, inevitably including some irrelevant regions, particularly near object boundaries or within background areas inside the window. The fixed receptive field in convolution can be interpreted as a special case of attention, where the attention score is set to 1 within the receptive field and 0 outside it. 2) Self-attention adaptively identifies relevant elements by assigning greater attention scores to areas with higher relevance, thereby mitigating the inclusion of noisy or irrelevant information.
  • Figure 2: Difference between convolution and self-attention in encoding relevant elements: consider the scenario where convolution and self-attention are capturing the structure of a circle. 1) Convolution learns separate parameters $\{{\bm{W}}_{\delta_x, \delta_y}\}$ for each offset, where $\delta_x, \delta_y \in [-1,1]$, from the kernel center, allowing it to effectively encode relative local structures. Thus, when the circle appears in a different location, it is still readily recognized due to this relative awareness. 2) Self-attention incorporates absolute position into each token's representation and uses position-irrelevant parameters ${\bm{W}} \in \{{\bm{W}}^q, {\bm{W}}^k, {\bm{W}}^v\}$ across all tokens for computing query, key and value, respectively. While this method facilitates general processing, the inclusion of absolute positional embeddings makes it more challenging to recognize the circle when it is moved to a different location.
  • Figure 3: Comparison of self-attention and Translution. 1) Self-attention employs three shared sets of weights, i.e., ${\bm{W}}^q$, ${\bm{W}}^k$, and ${\bm{W}}^v$, across all patches to compute query, key, and value, respectively. 2) Translution uses separate parameters for each offset (direction and distance), i.e., $\{{\bm{W}}^q_{\delta_x, \delta_y}\}$, $\{{\bm{W}}^k_{\delta_x, \delta_y}\}$ and $\{{\bm{W}}^v_{\delta_x, \delta_y}\}$, to encode relative structures.
  • Figure 4: Examples of static and dynamic MNIST. Static MNIST digits are fixed at the center of images, whereas dynamic MNIST digits are randomly positioned within the images.
  • Figure 5: When modeling text, Translution operates in a 1D setting. For a sequence of length $N$, it employs separate parameters for each positional offset (considering both direction and distance), i.e., $\{{\bm{W}}^q_{-(N-1)}, \cdots, {\bm{W}}^q_{0}, \cdots, {\bm{W}}^q_{N-1}\}$, $\{{\bm{W}}^k_{-(N-1)}, \cdots, {\bm{W}}^k_{0}, \cdots, {\bm{W}}^k_{N-1}\}$ and $\{{\bm{W}}^v_{-(N-1)}, \cdots, {\bm{W}}^v_{0}, \cdots, {\bm{W}}^v_{N-1}\}$, to encode relative language structure.
  • ...and 2 more figures