Table of Contents
Fetching ...

Exploring the Limits of Semantic Image Compression at Micro-bits per Pixel

Jordan Dotzel, Bahaa Kotb, James Dotzel, Mohamed Abdelfattah, Zhiru Zhang

TL;DR

This paper investigates the lower limits of semantic image compression by using GPT-4V as the encoder and DALL-E3 as the decoder, augmented with an iterative reflection mechanism to refine outputs. It demonstrates near-$100 μbpp$ semantic compression at $1024×1024$ resolutions, with up to $10{,}000×$ smaller representations than JPEG. The study highlights that higher resolutions scale sub-linearly in semantic content and discusses practical benefits for large-scale data transmission in collaborative virtual environments. It also identifies limitations in object orientation, color consistency, and occasional hallucinations, suggesting that future gains will come from improved descriptive fidelity and independent editing capabilities.

Abstract

Traditional methods, such as JPEG, perform image compression by operating on structural information, such as pixel values or frequency content. These methods are effective to bitrates around one bit per pixel (bpp) and higher at standard image sizes. In contrast, text-based semantic compression directly stores concepts and their relationships using natural language, which has evolved with humans to efficiently represent these salient concepts. These methods can operate at extremely low bitrates by disregarding structural information like location, size, and orientation. In this work, we use GPT-4V and DALL-E3 from OpenAI to explore the quality-compression frontier for image compression and identify the limitations of current technology. We push semantic compression as low as 100 $μ$bpp (up to $10,000\times$ smaller than JPEG) by introducing an iterative reflection process to improve the decoded image. We further hypothesize this 100 $μ$bpp level represents a soft limit on semantic compression at standard image resolutions.

Exploring the Limits of Semantic Image Compression at Micro-bits per Pixel

TL;DR

This paper investigates the lower limits of semantic image compression by using GPT-4V as the encoder and DALL-E3 as the decoder, augmented with an iterative reflection mechanism to refine outputs. It demonstrates near- semantic compression at resolutions, with up to smaller representations than JPEG. The study highlights that higher resolutions scale sub-linearly in semantic content and discusses practical benefits for large-scale data transmission in collaborative virtual environments. It also identifies limitations in object orientation, color consistency, and occasional hallucinations, suggesting that future gains will come from improved descriptive fidelity and independent editing capabilities.

Abstract

Traditional methods, such as JPEG, perform image compression by operating on structural information, such as pixel values or frequency content. These methods are effective to bitrates around one bit per pixel (bpp) and higher at standard image sizes. In contrast, text-based semantic compression directly stores concepts and their relationships using natural language, which has evolved with humans to efficiently represent these salient concepts. These methods can operate at extremely low bitrates by disregarding structural information like location, size, and orientation. In this work, we use GPT-4V and DALL-E3 from OpenAI to explore the quality-compression frontier for image compression and identify the limitations of current technology. We push semantic compression as low as 100 bpp (up to smaller than JPEG) by introducing an iterative reflection process to improve the decoded image. We further hypothesize this 100 bpp level represents a soft limit on semantic compression at standard image resolutions.
Paper Structure (10 sections, 8 figures)

This paper contains 10 sections, 8 figures.

Figures (8)

  • Figure 1: Compression Regions: This work explores the limits of semantic compression with ChatGPT4 and demonstrates improvements through iterative reflection.
  • Figure 2: Method and Examples: ChatGPT4 can perform semantic compression at the 100 $\mu$bpp level, capturing only the most important concepts in the image with respect to human preferences.
  • Figure 3: Compression Examples: The first example shows the progressive loss of contextual information from tile details, room color, location of the man, sitting vs. standing. The second example, on the other hand, shows that landmarks and proper nouns like the Taj Mahal taken from standard angles can be compressed extremely small to 10s of $\mu$bpp since a significant amount of information is captured within a few words. The third example again shows the gradual loss of context, color, gender, and location. The fourth example shows the progressive loss of contextual information including light colors, figure position, and style of the lights. The fifth example shows that with heavy compression it hyper-focuses on certain arbitrary details like the flowers. Finally, the last example shows the loss of information about the jacket color and other details with higher compression.
  • Figure 4: More Compression Examples: These examples show the usefulness of image-specific, variable-rate compression using fewer bits for more common images and gradual decline in quality in most examples at lower bitrates.
  • Figure 5: Bearded Man: An example of higher bitrates to demonstrate the effectiveness of reflection with sufficient context. Originally, the model produces two men and corrects for its mistake. Then, it has a regression on the floral pattern but identifies it and adjusts appropriately. It follows much of the detail in the uncompressed description.
  • ...and 3 more figures