Table of Contents
Fetching ...

TreeCoders: Trees of Transformers

Pierre Colonna D'Istria, Abdulrahman Altahhan

TL;DR

TreeCoders, a novel family of transformer trees moved away from traditional linear transformers to complete k-ary trees, demonstrates that the proposed tree transformer model outperforms a size-equivalent linear transformer model 76% of the time over a wide range of tree architectures.

Abstract

In this paper, we introduce TreeCoders, a novel family of transformer trees. We moved away from traditional linear transformers to complete k-ary trees. Transformer blocks serve as nodes, and generic classifiers learn to select the best child and route the sequence of tokens to a specific leaf. The selectors, moved outside the transformer blocks, allow for the use of a variety of architecture without further modifications. Furthermore, our proposed architecture supports sparse node activation due to the logarithmic complexity of a tree search. We validate our idea by testing a series of decoder-only tree transformers, achieving competitive results across a diverse range of language datasets. Our study demonstrates that the proposed tree transformer model outperforms a size-equivalent linear transformer model 76\% of the time over a wide range of tree architectures. Furthermore, our proposed model naturally lends itself to distributed implementation.

TreeCoders: Trees of Transformers

TL;DR

TreeCoders, a novel family of transformer trees moved away from traditional linear transformers to complete k-ary trees, demonstrates that the proposed tree transformer model outperforms a size-equivalent linear transformer model 76% of the time over a wide range of tree architectures.

Abstract

In this paper, we introduce TreeCoders, a novel family of transformer trees. We moved away from traditional linear transformers to complete k-ary trees. Transformer blocks serve as nodes, and generic classifiers learn to select the best child and route the sequence of tokens to a specific leaf. The selectors, moved outside the transformer blocks, allow for the use of a variety of architecture without further modifications. Furthermore, our proposed architecture supports sparse node activation due to the logarithmic complexity of a tree search. We validate our idea by testing a series of decoder-only tree transformers, achieving competitive results across a diverse range of language datasets. Our study demonstrates that the proposed tree transformer model outperforms a size-equivalent linear transformer model 76\% of the time over a wide range of tree architectures. Furthermore, our proposed model naturally lends itself to distributed implementation.

Paper Structure

This paper contains 18 sections, 1 equation, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: A tree structure allows for a sparse activation of the network. The sparsity will also grow with the tree
  • Figure 2: Example of two different tree architectures where the tokens go through the same number of decoder layers. On the left is a tree (h, dec) = (1, 3) of height 1, with 3 decoder layers per node. On the right is a tree of height 2 with 2 decoder layers per node. In both cases, the path length is 6.
  • Figure 3: Example of a binary tree of height 2 and N decoder layer per node.
  • Figure 4: Some of the different topologies possible with our method
  • Figure 5: Path of a sequence of tokens through a transformer node, a selector, and then sent to the selected child node.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 1: Node
  • Definition 2: Height
  • Definition 3: Number of layers
  • Definition 4: Token Path Length