Table of Contents
Fetching ...

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Chunyi Li, Xiele Wu, Haoning Wu, Donghui Feng, Zicheng Zhang, Guo Lu, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, Weisi Lin

TL;DR

This work introduces CMC-Bench, a benchmark for evaluating cross-modality image compression that couples I2T and T2I models to achieve ultra-low bitrate performance. It provides a large-scale dataset (58,000 images) and 160,000 expert subjective scores across four compression modes (Text, Pixel, Image, Full) to jointly assess consistency and perception. The study demonstrates that certain I2T+T2I combinations can outperform traditional codecs at very low bitrates, while outlining limitations and directions for improving model design and robustness across content types. By releasing ground-truth data, evaluation metrics, and baselines, CMC-Bench aims to accelerate the development of semantic-level visual codecs and invites broad participation from LMM developers.

Abstract

Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1\% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

TL;DR

This work introduces CMC-Bench, a benchmark for evaluating cross-modality image compression that couples I2T and T2I models to achieve ultra-low bitrate performance. It provides a large-scale dataset (58,000 images) and 160,000 expert subjective scores across four compression modes (Text, Pixel, Image, Full) to jointly assess consistency and perception. The study demonstrates that certain I2T+T2I combinations can outperform traditional codecs at very low bitrates, while outlining limitations and directions for improving model design and robustness across content types. By releasing ground-truth data, evaluation metrics, and baselines, CMC-Bench aims to accelerate the development of semantic-level visual codecs and invites broad participation from LMM developers.

Abstract

Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1\% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.
Paper Structure (22 sections, 2 equations, 19 figures, 6 tables)

This paper contains 22 sections, 2 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Overview of CMC-Bench. We demonstrate the superiority of Cross Modality Compression over traditional codecs, and subjective and objective evaluations of compression results on Consistency and Perception. This benchmark can motivate it to become the future codec paradigm.
  • Figure 2: Source data illustration of CMC-Bench from three content types.
  • Figure 3: Illustration of 4 working modes of CMC. Text mode roughly reconstructs the semantic information, Pixel mode slightly improves low-level consistency, Image mode provides a similar structure towards ground truth but a different character, and Full mode has the best performance.
  • Figure 4: A radar map illustrates the collaboration of mainstream I2T (left) and T2I (right) LMMs. The model are tested as {6 different I2Ts + RealVis gen:RealVis} and {GPT-4o i2t:gpt4 + 12 different T2Is}.
  • Figure 5: Illustration of subjective preference in terms of Mean Opinion Score.
  • ...and 14 more figures