Table of Contents
Fetching ...

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

Tim Elsner, Paula Usinger, Julius Nehring-Wirxel, Gregor Kobsik, Victor Czech, Yanjiang He, Isaak Lim, Leif Kobbelt

TL;DR

This work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression, and introduces a strategy to amplify this compression further by clustering the vocabulary.

Abstract

In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, without global content awareness. Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. The multidimensionality only increases the computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences are shorter, with more uniformly distributed information content, e.g. condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to process. We additionally introduce a strategy to amplify this compression further by clustering the vocabulary.

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

TL;DR

This work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression, and introduces a strategy to amplify this compression further by clustering the vocabulary.

Abstract

In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, without global content awareness. Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. The multidimensionality only increases the computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences are shorter, with more uniformly distributed information content, e.g. condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to process. We additionally introduce a strategy to amplify this compression further by clustering the vocabulary.

Paper Structure

This paper contains 27 sections, 1 equation, 22 figures, 2 tables, 1 algorithm.

Figures (22)

  • Figure 1: Our algorithm compresses visual data in order to make tasks like generation more efficient: Shorter sequences, even if they are from a larger vocabulary, are easier to handle for deep learning architectures like transformers. The images show representative examples after the same training time, with training on shortened sequences (right) producing better results faster.
  • Figure 2: Token classes and unique token ID. The resulting sequence would be AGACBDFEGF (compressed from length $16$ to length $10$).
  • Figure 3: We replace token constellations occurring frequently by sliding a pairwise mask over each dimension.
  • Figure 4: For predicting the next token at a position (red cross), we can reroll tokens that exceed the boundary (orange) or overlap with existing tokens (blue).
  • Figure 5: In addition to giving a positional encoding of the token position itself, we also give the network the position of the next token (implicitly defined through previous token shapes) and an integrated positional encoding (i.e. sum over $Pe$) describing the area of the token shape.
  • ...and 17 more figures