Exploring the Limits of Semantic Image Compression at Micro-bits per Pixel
Jordan Dotzel, Bahaa Kotb, James Dotzel, Mohamed Abdelfattah, Zhiru Zhang
TL;DR
This paper investigates the lower limits of semantic image compression by using GPT-4V as the encoder and DALL-E3 as the decoder, augmented with an iterative reflection mechanism to refine outputs. It demonstrates near-$100 μbpp$ semantic compression at $1024×1024$ resolutions, with up to $10{,}000×$ smaller representations than JPEG. The study highlights that higher resolutions scale sub-linearly in semantic content and discusses practical benefits for large-scale data transmission in collaborative virtual environments. It also identifies limitations in object orientation, color consistency, and occasional hallucinations, suggesting that future gains will come from improved descriptive fidelity and independent editing capabilities.
Abstract
Traditional methods, such as JPEG, perform image compression by operating on structural information, such as pixel values or frequency content. These methods are effective to bitrates around one bit per pixel (bpp) and higher at standard image sizes. In contrast, text-based semantic compression directly stores concepts and their relationships using natural language, which has evolved with humans to efficiently represent these salient concepts. These methods can operate at extremely low bitrates by disregarding structural information like location, size, and orientation. In this work, we use GPT-4V and DALL-E3 from OpenAI to explore the quality-compression frontier for image compression and identify the limitations of current technology. We push semantic compression as low as 100 $μ$bpp (up to $10,000\times$ smaller than JPEG) by introducing an iterative reflection process to improve the decoded image. We further hypothesize this 100 $μ$bpp level represents a soft limit on semantic compression at standard image resolutions.
