Table of Contents
Fetching ...

Large Language Model for Lossless Image Compression with Visual Prompts

Junhao Du, Chuqin Zhou, Ning Cao, Gang Chen, Yunuo Chen, Zhengxue Cheng, Li Song, Guo Lu, Wenjun Zhang

TL;DR

This paper tackles lossless image compression by bridging the gap between textual priors in large language models (LLMs) and visual data. It introduces a visual-prompt framework in which a lossy reconstruction provides prompts to an LLM that models the residual distribution via a Gaussian Mixture Model, with global/local visual embeddings and residual embeddings guiding the inference. The method, optionally LoRA-finetuned, achieves state-of-the-art performance on DIV2K, CLIC, Kodak, and demonstrates strong domain generalization to screen content and medical images. The results indicate that LLMs can serve as powerful entropy estimators for lossless compression, potentially inaugurating a new paradigm for image coding with broad practical implications.

Abstract

Recent advancements in deep learning have driven significant progress in lossless image compression. With the emergence of Large Language Models (LLMs), preliminary attempts have been made to leverage the extensive prior knowledge embedded in these pretrained models to enhance lossless image compression, particularly by improving the entropy model. However, a significant challenge remains in bridging the gap between the textual prior knowledge within LLMs and lossless image compression. To tackle this challenge and unlock the potential of LLMs, this paper introduces a novel paradigm for lossless image compression that incorporates LLMs with visual prompts. Specifically, we first generate a lossy reconstruction of the input image as visual prompts, from which we extract features to serve as visual embeddings for the LLM. The residual between the original image and the lossy reconstruction is then fed into the LLM along with these visual embeddings, enabling the LLM to function as an entropy model to predict the probability distribution of the residual. Extensive experiments on multiple benchmark datasets demonstrate our method achieves state-of-the-art compression performance, surpassing both traditional and learning-based lossless image codecs. Furthermore, our approach can be easily extended to images from other domains, such as medical and screen content images, achieving impressive performance. These results highlight the potential of LLMs for lossless image compression and may inspire further research in related directions.

Large Language Model for Lossless Image Compression with Visual Prompts

TL;DR

This paper tackles lossless image compression by bridging the gap between textual priors in large language models (LLMs) and visual data. It introduces a visual-prompt framework in which a lossy reconstruction provides prompts to an LLM that models the residual distribution via a Gaussian Mixture Model, with global/local visual embeddings and residual embeddings guiding the inference. The method, optionally LoRA-finetuned, achieves state-of-the-art performance on DIV2K, CLIC, Kodak, and demonstrates strong domain generalization to screen content and medical images. The results indicate that LLMs can serve as powerful entropy estimators for lossless compression, potentially inaugurating a new paradigm for image coding with broad practical implications.

Abstract

Recent advancements in deep learning have driven significant progress in lossless image compression. With the emergence of Large Language Models (LLMs), preliminary attempts have been made to leverage the extensive prior knowledge embedded in these pretrained models to enhance lossless image compression, particularly by improving the entropy model. However, a significant challenge remains in bridging the gap between the textual prior knowledge within LLMs and lossless image compression. To tackle this challenge and unlock the potential of LLMs, this paper introduces a novel paradigm for lossless image compression that incorporates LLMs with visual prompts. Specifically, we first generate a lossy reconstruction of the input image as visual prompts, from which we extract features to serve as visual embeddings for the LLM. The residual between the original image and the lossy reconstruction is then fed into the LLM along with these visual embeddings, enabling the LLM to function as an entropy model to predict the probability distribution of the residual. Extensive experiments on multiple benchmark datasets demonstrate our method achieves state-of-the-art compression performance, surpassing both traditional and learning-based lossless image codecs. Furthermore, our approach can be easily extended to images from other domains, such as medical and screen content images, achieving impressive performance. These results highlight the potential of LLMs for lossless image compression and may inspire further research in related directions.

Paper Structure

This paper contains 16 sections, 3 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of the encoding and decoding process. A lossy reconstruction $\mathbf{x}_l$ and its patch $\mathbf{x}_l^n$ serve as visual prompts for the LLM to predict the residual's probability distribution, with the decoding process mirroring encoding by generating residual tokens autoregressively. The red dashed line represents the autoregressive process, where the decoded residuals serve as input to the LLM to predict the probability distribution of the next residual. This process continues until all residuals are decoded. (AE: Arithmetic Encoder. AD: Arithmetic Decoder. LLM: Large Language Model.)
  • Figure 2: Our distribution estimation framework based on LLM. Visual embeddings, including the global embeddings $\mathbf{z}_{g}$ and local embeddings $\mathbf{z}_{l}^n$, enhance the inference. The output feature of LLM $f^n$ are projected onto a Gaussian Mixture Model (GMM) to estimate the residual's probability distribution.