Bytes Are All You Need: Transformers Operating Directly On File Bytes

Maxwell Horton; Sachin Mehta; Ali Farhadi; Mohammad Rastegari

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

TL;DR

ByteFormer demonstrates that a Transformer can operate directly on file bytes to perform modality-agnostic inference, eliminating the need for decoding into modality-specific representations at test time. By incorporating a 1D byte embedding, strided convolution for downsampling, and shifted window attention with hierarchical down-sampling, it handles long byte sequences efficiently. The method achieves competitive ImageNet results across multiple encodings and strong Speech Commands V2 performance without modality-specific tuning, and it can jointly classify images and audio with a single model. These findings suggest practical potential for cross-domain, byte-level representation learning and raise avenues for encoding-aware analyses and privacy-preserving inference.

Abstract

Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate modality-independent representation learning by performing classification directly on file bytes, without the need for decoding files at inference time. This enables models to operate on various modalities without any hand-designed, modality-specific processing. Our model, ByteFormer, improves ImageNet Top-1 classification accuracy by $5\%$ (from $72.2\%$ to $77.33\%$) relative to DeIT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve $95.42\%$ classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of $98.7\%$). Additionally, we demonstrate that ByteFormer can operate jointly on images and audio, handling joint classification without explicit knowledge of the input modality. We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer.

Bytes Are All You Need: Transformers Operating Directly On File Bytes

TL;DR

Abstract

(from

) relative to DeIT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve

classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of

). Additionally, we demonstrate that ByteFormer can operate jointly on images and audio, handling joint classification without explicit knowledge of the input modality. We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer.

Paper Structure (34 sections, 6 figures, 5 tables)

This paper contains 34 sections, 6 figures, 5 tables.

Introduction
Related Work
Method
ByteFormer
Implementation Details
Experimental Setup
Image File Encodings
Audio File Encodings
Preprocessing
Evaluating ByteFormer
Evaluating ByteFormer on ImageNet
Dataset and training details.
Effect of image file encodings.
Effect of $k$.
Comparison with existing multimodal methods.
...and 19 more sections

Figures (6)

Figure 1: ByteFormer vs. ViT. (a) A standard vision Transformer (ViT) decodes file bytes into an RGB image. Subsequently, the image is split into patches and patch embeddings are extracted and fed to Transformer to obtain contextualized patch embeddings, which are then classified using a linear classifier. (b) ByteFormer directly operations on file bytes.
Figure 2: (a-c): Illustration of the types of attention used in ablations. Bag attention is computed in two stages. First, individual bags compute attention. Then, attention is computed across bags. (d): ImageNet Top-1 accuracy of BF-Ti with different types of attention. We run out of memory with full attention.
Figure 3: $|x \cdot y| / (||x|| \cdot ||y||)$ for pairs $x, y$ of token embeddings (top row) and positional embeddings (bottom row) learned by BF-Ti. We show results for various file encodings on ImageNet (IN) and Speech Commands V2 (SC).
Figure 4: An overview of our byte remapping method and our masking camera method. (a): In our byte remapping method, we remap byte values using a permutation function before passing inputs to our model. (b): In our masking camera method, our model inputs are heavily masked images rasterized into a continuous array.
Figure 5: (a): A sample image from the ImageNet validation set, with uniform noise applied (top row), and with byte remapping $\phi$ additionally applied (bottom row). (b): ImageNet Top-1 results for obfuscation with $\phi$. We show results with no noise, and with uniform noise in $[-a, a]$ added. We use the fHWC encoding.
...and 1 more figures

Bytes Are All You Need: Transformers Operating Directly On File Bytes

TL;DR

Abstract

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Authors

TL;DR

Abstract

Table of Contents

Figures (6)