Table of Contents
Fetching ...

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

TL;DR

ByteFormer demonstrates that a Transformer can operate directly on file bytes to perform modality-agnostic inference, eliminating the need for decoding into modality-specific representations at test time. By incorporating a 1D byte embedding, strided convolution for downsampling, and shifted window attention with hierarchical down-sampling, it handles long byte sequences efficiently. The method achieves competitive ImageNet results across multiple encodings and strong Speech Commands V2 performance without modality-specific tuning, and it can jointly classify images and audio with a single model. These findings suggest practical potential for cross-domain, byte-level representation learning and raise avenues for encoding-aware analyses and privacy-preserving inference.

Abstract

Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate modality-independent representation learning by performing classification directly on file bytes, without the need for decoding files at inference time. This enables models to operate on various modalities without any hand-designed, modality-specific processing. Our model, ByteFormer, improves ImageNet Top-1 classification accuracy by $5\%$ (from $72.2\%$ to $77.33\%$) relative to DeIT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve $95.42\%$ classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of $98.7\%$). Additionally, we demonstrate that ByteFormer can operate jointly on images and audio, handling joint classification without explicit knowledge of the input modality. We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer.

Bytes Are All You Need: Transformers Operating Directly On File Bytes

TL;DR

ByteFormer demonstrates that a Transformer can operate directly on file bytes to perform modality-agnostic inference, eliminating the need for decoding into modality-specific representations at test time. By incorporating a 1D byte embedding, strided convolution for downsampling, and shifted window attention with hierarchical down-sampling, it handles long byte sequences efficiently. The method achieves competitive ImageNet results across multiple encodings and strong Speech Commands V2 performance without modality-specific tuning, and it can jointly classify images and audio with a single model. These findings suggest practical potential for cross-domain, byte-level representation learning and raise avenues for encoding-aware analyses and privacy-preserving inference.

Abstract

Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate modality-independent representation learning by performing classification directly on file bytes, without the need for decoding files at inference time. This enables models to operate on various modalities without any hand-designed, modality-specific processing. Our model, ByteFormer, improves ImageNet Top-1 classification accuracy by (from to ) relative to DeIT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of ). Additionally, we demonstrate that ByteFormer can operate jointly on images and audio, handling joint classification without explicit knowledge of the input modality. We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer.
Paper Structure (34 sections, 6 figures, 5 tables)

This paper contains 34 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: ByteFormer vs. ViT. (a) A standard vision Transformer (ViT) decodes file bytes into an RGB image. Subsequently, the image is split into patches and patch embeddings are extracted and fed to Transformer to obtain contextualized patch embeddings, which are then classified using a linear classifier. (b) ByteFormer directly operations on file bytes.
  • Figure 2: (a-c): Illustration of the types of attention used in ablations. Bag attention is computed in two stages. First, individual bags compute attention. Then, attention is computed across bags. (d): ImageNet Top-1 accuracy of BF-Ti with different types of attention. We run out of memory with full attention.
  • Figure 3: $|x \cdot y| / (||x|| \cdot ||y||)$ for pairs $x, y$ of token embeddings (top row) and positional embeddings (bottom row) learned by BF-Ti. We show results for various file encodings on ImageNet (IN) and Speech Commands V2 (SC).
  • Figure 4: An overview of our byte remapping method and our masking camera method. (a): In our byte remapping method, we remap byte values using a permutation function before passing inputs to our model. (b): In our masking camera method, our model inputs are heavily masked images rasterized into a continuous array.
  • Figure 5: (a): A sample image from the ImageNet validation set, with uniform noise applied (top row), and with byte remapping $\phi$ additionally applied (bottom row). (b): ImageNet Top-1 results for obfuscation with $\phi$. We show results with no noise, and with uniform noise in $[-a, a]$ added. We use the fHWC encoding.
  • ...and 1 more figures