Table of Contents
Fetching ...

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov

TL;DR

This paper proposes a simple yet effective paradigm for visual generation by training autoregressive transformers to output canonical codec streams—JPEG for images and AVC/H.264 for videos. By treating codec bytes as discrete tokens and expanding the vocabulary with BPE, the authors train vanilla Llama-2 models (Jpeg-LM and Avc-LM) without vision-specific modules, achieving strong image results and a notable 31% FID improvement over VQ baselines, with particular strength on long-tail visual elements. The work demonstrates both qualitative realism and quantitative gains in zero-shot, partially prompted, and unconditional settings, and provides a proof-of-concept for video generation. Overall, this codec-centric approach lowers barriers to unifying language and visual generation, suggesting scalable avenues for multi-modal LLM research while highlighting areas for future efficiency, scaling, and safety considerations.

Abstract

Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

TL;DR

This paper proposes a simple yet effective paradigm for visual generation by training autoregressive transformers to output canonical codec streams—JPEG for images and AVC/H.264 for videos. By treating codec bytes as discrete tokens and expanding the vocabulary with BPE, the authors train vanilla Llama-2 models (Jpeg-LM and Avc-LM) without vision-specific modules, achieving strong image results and a notable 31% FID improvement over VQ baselines, with particular strength on long-tail visual elements. The work demonstrates both qualitative realism and quantitative gains in zero-shot, partially prompted, and unconditional settings, and provides a proof-of-concept for video generation. Overall, this codec-centric approach lowers barriers to unifying language and visual generation, suggesting scalable avenues for multi-modal LLM research while highlighting areas for future efficiency, scaling, and safety considerations.

Abstract

Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.
Paper Structure (29 sections, 11 figures, 2 tables)

This paper contains 29 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Jpeg-LM and Avc-LM are simple autoregressive transformers that directly model and generate canonical file encodings.
  • Figure 2: Generated images by Jpeg-LM and baselines with partial images as prompts. We show three random samples from Jpeg-LM and one from VQ transformer and ImageGPT (with super-resolution). The original images for the prompts are independently sourced outside existing training sets. We observe that Jpeg-LM can generate realistic facial expressions, landscape, common objects, texts in image forms, etc. Additionally, Jpeg-LM shows an especial advantage over baselines on meaningful details like human eyes. \ref{['fig:appendix_uncond_jpeglm_show']} and \ref{['fig:appendix_uncond_vq_show']} show more examples of Jpeg-LM and VQ transformer on unconditional generation.
  • Figure 3: Compression effect of VQ and JPEG (zoom in for the best view). JPEG is significantly better in detailed but highly perceptible elements like small human faces and text characters. VQ has a relative advantage in color and sharpness preservation.
  • Figure 5: Generated video frames by Avc-LM on held-out test data. The first 10 frames are given to the model as the prompt, and the last 5 frames are generated by the model.
  • Figure 6: Unconditional generation by Jpeg-LM.
  • ...and 6 more figures