Table of Contents
Fetching ...

ConcatPlexer: Additional Dim1 Batching for Faster ViTs

Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, Nojun Kwak

TL;DR

This work tackles the high computational cost of vision transformers by transferring DataMUX ideas from NLP to vision, introducing ConcatPlexer for multiplexed image classification. It combines a Transformer-based Patchifier with a Conv-based C-Multiplexer to fuse $N_{MUX}$ images into a single forward pass, followed by a shared backbone and a demultiplexing path. The approach yields up to 23.5% fewer GFLOPs than ViT-B/16 while achieving competitive validation accuracy on ImageNet1K and CIFAR100, and demonstrates a clear throughput-accuracy trade-off as the multiplexing factor grows. Overall, the method shows that data multiplexing is a viable route to accelerate vision transformers and hints at extensions to multimodal multiplexing.

Abstract

Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.

ConcatPlexer: Additional Dim1 Batching for Faster ViTs

TL;DR

This work tackles the high computational cost of vision transformers by transferring DataMUX ideas from NLP to vision, introducing ConcatPlexer for multiplexed image classification. It combines a Transformer-based Patchifier with a Conv-based C-Multiplexer to fuse images into a single forward pass, followed by a shared backbone and a demultiplexing path. The approach yields up to 23.5% fewer GFLOPs than ViT-B/16 while achieving competitive validation accuracy on ImageNet1K and CIFAR100, and demonstrates a clear throughput-accuracy trade-off as the multiplexing factor grows. Overall, the method shows that data multiplexing is a viable route to accelerate vision transformers and hints at extensions to multimodal multiplexing.

Abstract

Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.
Paper Structure (15 sections, 5 equations, 2 figures, 6 tables)

This paper contains 15 sections, 5 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overall architecture of (a) Image Multiplexer and (b) ConcatPlexer. The Image Multiplexer multiplexes $N_{MUX}$ images using MLP and fixed orthogonal matrices. The ConcatPlexer uses a conv layer to reduce the length of each image token and concatenates them. $N_{MUX}$ is abbreviated as N in this figure.
  • Figure 2: The architecture of (a) Multiplexer and (b) C-Multiplexer. Both inputs $N_{MUX}$ of inputs and combine them into a single input. $N_{MUX}$ is N for this figure.