Table of Contents
Fetching ...

Learned Image Compression with Text Quality Enhancement

Chih-Yu Lai, Dung Tran, Kazuhito Koishida

TL;DR

This paper tackles text distortion in screen-content image compression by introducing a plug-and-play text logit loss that measures textual fidelity using OCR-based logits from cropped text regions. The loss is integrated into the standard rate-distortion objective, enabling end-to-end training without architectural changes. Across two SCI datasets and five entropy-based codecs, the method yields consistent reductions in Character Error Rate and Word Error Rate at the same bitrate, with BD-CER and BD-WER demonstrating substantial text-quality gains. The findings support the feasibility of text-aware compression and offer practical guidance on balancing the text loss weight $\kappa$ to improve text reconstruction without excessive bitrate penalties.

Abstract

Learned image compression has gained widespread popularity for their efficiency in achieving ultra-low bit-rates. Yet, images containing substantial textual content, particularly screen-content images (SCI), often suffers from text distortion at such compressed levels. To address this, we propose to minimize a novel text logit loss designed to quantify the disparity in text between the original and reconstructed images, thereby improving the perceptual quality of the reconstructed text. Through rigorous experimentation across diverse datasets and employing state-of-the-art algorithms, our findings reveal significant enhancements in the quality of reconstructed text upon integration of the proposed loss function with appropriate weighting. Notably, we achieve a Bjontegaard delta (BD) rate of -32.64% for Character Error Rate (CER) and -28.03% for Word Error Rate (WER) on average by applying the text logit loss for two screenshot datasets. Additionally, we present quantitative metrics tailored for evaluating text quality in image compression tasks. Our findings underscore the efficacy and potential applicability of our proposed text logit loss function across various text-aware image compression contexts.

Learned Image Compression with Text Quality Enhancement

TL;DR

This paper tackles text distortion in screen-content image compression by introducing a plug-and-play text logit loss that measures textual fidelity using OCR-based logits from cropped text regions. The loss is integrated into the standard rate-distortion objective, enabling end-to-end training without architectural changes. Across two SCI datasets and five entropy-based codecs, the method yields consistent reductions in Character Error Rate and Word Error Rate at the same bitrate, with BD-CER and BD-WER demonstrating substantial text-quality gains. The findings support the feasibility of text-aware compression and offer practical guidance on balancing the text loss weight to improve text reconstruction without excessive bitrate penalties.

Abstract

Learned image compression has gained widespread popularity for their efficiency in achieving ultra-low bit-rates. Yet, images containing substantial textual content, particularly screen-content images (SCI), often suffers from text distortion at such compressed levels. To address this, we propose to minimize a novel text logit loss designed to quantify the disparity in text between the original and reconstructed images, thereby improving the perceptual quality of the reconstructed text. Through rigorous experimentation across diverse datasets and employing state-of-the-art algorithms, our findings reveal significant enhancements in the quality of reconstructed text upon integration of the proposed loss function with appropriate weighting. Notably, we achieve a Bjontegaard delta (BD) rate of -32.64% for Character Error Rate (CER) and -28.03% for Word Error Rate (WER) on average by applying the text logit loss for two screenshot datasets. Additionally, we present quantitative metrics tailored for evaluating text quality in image compression tasks. Our findings underscore the efficacy and potential applicability of our proposed text logit loss function across various text-aware image compression contexts.
Paper Structure (11 sections, 7 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 7 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: High-level training scheme for text-aware image compression: The coordinates for all text regions inside the original image are extracted as $\mathscr{B}$ and used to crop the original ($x$) and reconstructed ($\hat{x}$) images into two lists of cropped text regions, $\{b_1,...,b_n\}$ and $\{\hat{b}_1,...,\hat{b}_n\}$. The text is recognized to obtain lists of logits $\{v_1,...,v_n\}$ and $\{\hat{v}_1,...,\hat{v}_n\}$, and then compared to calculate the text logit loss ($\mathscr{T}(x,\hat{x})$). During backpropagation, the loss gradient is reflected in the weights of the compression model.
  • Figure 2: Character error rate (CER) vs Bits per pixel (BPP) and Word error rate (WER) vs BPP for the CIRCL and Website Screenshot datasets. Results with and without using the text logit loss are shown. $\kappa = 0.1$ for all experiments. Generally, the CER/WER when using the text logit loss are lower at the same BPP compared to not using the text logit loss. This is more significant when the BPP is lower.
  • Figure 3: Original and reconstructed images from five screenshots in the WebScreenshots dataset. 'w/o $\mathscr{T}(x,\hat{x})$' denotes training without using the text logit loss, while 'w/ $\mathscr{T}(x,\hat{x})$' indicates training with the text logit loss. Perceptually, the reconstructed text without the text logit loss appears blurrier and more distorted. For example, in the bottom-left figure (Ballé 2018 with text logit loss), the text at the bottom line is easily readable, whereas in the center-left figure (Ballé 2018 without text logit loss), it is harder to read.
  • Figure 4: CER and WER vs BPP for the Website Screenshot dataset. The results are presented using the text loss for one specific $\lambda$ with $\kappa \in {0.001, 0.01, 0.1, 1, 10}$, and without using the text loss for $\lambda \in {0.0001, 0.0002, 0.0004, 0.0007, 0.001, 0.002, 0.004, 0.007, 0.01}$. Starting from $\kappa = 0.001$, the CER and WER show a more pronounced drop compared to keeping $\kappa=0$ and increasing $\lambda$, indicating that a small $\kappa$ is useful for relatively increasing text quality. When $\kappa$ is too large, CER and WER become excessively high, which is undesirable.