Table of Contents
Fetching ...

TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

Bingxin Wang, Yuan Lan, Zhaoyi Sun, Yang Xiang, Jie Sun

TL;DR

This work incorporates auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance and produce sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.

Abstract

Ultra-low bitrate image compression faces a critical challenge: preserving small-font scene text while maintaining overall visual quality. Region-of-interest (ROI) bit allocation can prioritize text but often degrades global fidelity, leading to a trade-off between local accuracy and overall image quality. Instead of relying on ROI coding, we incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance. Our method, TextBoost, operationalizes this idea through three strategic designs: (i) adaptively filtering OCR outputs and rendering them into a guidance map; (ii) integrating this guidance with decoder features in a calibrated manner via an attention-guided fusion block; and (iii) enforcing guidance-consistent reconstruction in text regions with a regularizing loss that promotes natural blending with the scene. Extensive experiments on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp), producing sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.

TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

TL;DR

This work incorporates auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance and produce sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.

Abstract

Ultra-low bitrate image compression faces a critical challenge: preserving small-font scene text while maintaining overall visual quality. Region-of-interest (ROI) bit allocation can prioritize text but often degrades global fidelity, leading to a trade-off between local accuracy and overall image quality. Instead of relying on ROI coding, we incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance. Our method, TextBoost, operationalizes this idea through three strategic designs: (i) adaptively filtering OCR outputs and rendering them into a guidance map; (ii) integrating this guidance with decoder features in a calibrated manner via an attention-guided fusion block; and (iii) enforcing guidance-consistent reconstruction in text regions with a regularizing loss that promotes natural blending with the scene. Extensive experiments on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp), producing sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.
Paper Structure (27 sections, 8 equations, 9 figures, 1 table)

This paper contains 27 sections, 8 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: TextBoost preserves small-text fidelity at ultra-low bitrates. Visual comparison on the TextOCR validation set singh2021textocr against ELIC he2022elic, TACO lee2024neural, and LIC-TCM liu2023learned. Our method delivers clearly better text reconstruction at similar or lower bitrates, with improved preservation of fine typographic details.
  • Figure 2: Overall pipeline of TextBoost. The framework is built upon a learned image compression backbone (comprising the Image Encoder, Hyperprior network, and Image Decoder). In parallel, the text branch extracts and transmits OCR information, which is processed by the Rendering-and-Alignment module to generate a visual guidance map. Finally, the Fusion Block integrates this guidance with the features decoded by the baseline network to produce the final output. SCCTX refers to the space-channel contextual model.
  • Figure 3: Examples of guidance maps generated from OCR information on the TextOCR dataset singh2021textocr. The first row shows the original scene images and the second row the corresponding rendering-and-alignment results. (1) and (3) illustrate robust handling of diverse in-plane orientations. By internally normalizing text to a horizontal layout, our method accurately renders text at distinct angles (e.g., vertical in (1) and slanted in (3)). (2) and (4) demonstrate selective transmission and precise rendering: large-font text is filtered out according to the average character-area criterion, while small-font content is rendered with accurate geometry and spatial placement.
  • Figure 4: Fusion block architecture and intermediate visualizations. Top: the proposed fusion block couples the auxiliary guidance with decoder features through element-wise modulation, channel expansion and concatenation, an attention module, and a final 1×1 projection to RGB. Bottom: visualizations aligned with each stage. (a) Hadamard product map: element-wise multiplication where white glyphs in the auxiliary map inherit color from the decoder features. (b) Concatenated features (13 + 3): the first 13 channels (blue branch) are projected from the decoder output, while the last 3 channels (green branch) correspond to the Hadamard product map. (c) Attention heatmap: activations of the attention block, mostly concentrated on small-font text areas. (d) Reconstruction: the final image after the 1×1 projection.
  • Figure 5: Quantitative evaluation of text spotting performance. Rate-distortion curves showing Text Detection (DET) and End-to-End Recognition (E2E) F-measures on TextOCR singh2021textocr and ICDAR 2015 karatzas2015icdar. TextBoost (red curves) yields significant improvements over state-of-the-art learned compression methods (e.g., LIC-TCM, ELIC) and ROI-based baselines, demonstrating superior text preservation capabilities at ultra-low bitrates.
  • ...and 4 more figures