Table of Contents
Fetching ...

Token Turing Machines are Efficient Vision Models

Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiruvathukal, James C. Davis, Yung-Hsiang Lu

TL;DR

ViTTM presents Vision Token Turing Machines, a memory-augmented ViT that uses two token streams (process and memory) and per-layer read-write interactions to reduce inference latency while preserving accuracy on image classification and semantic segmentation. By maintaining fewer process tokens and a larger external memory, ViTTM achieves substantial efficiency gains with competitive or improved accuracy compared to ViT baselines; in particular, ViTTM-B attains 82.9% Top-1 on ImageNet-1K with 234.1 ms latency, and 45.17 mIoU at 26.8 FPS on ADE20K, outperforming ViT-B in speed with minimal accuracy loss. Key design choices include linear attention for reads/writes, Add fusion, and non-processing of the memory stream, along with ablations that show memory tokens improve accuracy while process tokens drive most performance gains. Overall, ViTTM expands the Pareto frontier of accuracy versus latency for vision transformers and demonstrates the practicality of memory-augmented architectures for non-sequential vision tasks.

Abstract

We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).

Token Turing Machines are Efficient Vision Models

TL;DR

ViTTM presents Vision Token Turing Machines, a memory-augmented ViT that uses two token streams (process and memory) and per-layer read-write interactions to reduce inference latency while preserving accuracy on image classification and semantic segmentation. By maintaining fewer process tokens and a larger external memory, ViTTM achieves substantial efficiency gains with competitive or improved accuracy compared to ViT baselines; in particular, ViTTM-B attains 82.9% Top-1 on ImageNet-1K with 234.1 ms latency, and 45.17 mIoU at 26.8 FPS on ADE20K, outperforming ViT-B in speed with minimal accuracy loss. Key design choices include linear attention for reads/writes, Add fusion, and non-processing of the memory stream, along with ablations that show memory tokens improve accuracy while process tokens drive most performance gains. Overall, ViTTM expands the Pareto frontier of accuracy versus latency for vision transformers and demonstrates the practicality of memory-augmented architectures for non-sequential vision tasks.

Abstract

We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).
Paper Structure (27 sections, 4 equations, 4 figures, 8 tables)

This paper contains 27 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of our architecture with state-of-the-art methods. It is evident that ViTTM-B$_{(64,64)}$ has lower latency than its accuracy equivalents (e.g. Lookup-ViT$_{7\times7}$) while having higher accuracy than its latency equivalents (e.g. Lookup-ViT$_{3\times3}$).
  • Figure 2: Comparison of NTM/TTMs, ViTs, and ViTTMs. (a) NTMs are sequential models that process an input sequence of size $T$, where inputs $x_t$ are processed at each time step $t$ and the memory $M$ is read from and written at each time step. (b) ViTs process a single input $f_0$ ($=x_0$), through a series of $T$ layers, where each layer is indexed by $t$, the output features of each layer are denoted $f_t$. Our ViTTMs are a synthesis of the NTM and ViT architectures. ViTTMs integrate memory into the ViT architecture on a per-layer basis, processing a sequence of features $f_t$ rather than input sequences.
  • Figure 3: ViTTM Architecture. The ViTTM architecture is a NTM-ViT hybrid. In particular, ViTTM creates two views (or streams) of an input image, $x$, using two patch embedding layers. The memory stream, $M$, is created by a memory embedding layer, whereas the process stream, $P$, is created with a process embedding layer. Choose the memory stream to contain a greater number of tokens than the process i.e.$T > K$. The process and memory streams exchange information using read and write layers, followed by a fusion operation.
  • Figure 4: Illustration of our fusion implementations. (a) Erase (b) Add (c) Add-Erase. $\alpha$ is computed according to \ref{['eq:alpha']}. Depending on the location of the fusion operation the inputs vary (\ref{['fig:vittm_arch']}).