Table of Contents
Fetching ...

Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration

Umar Rashid, Muhammad Arslan Arshad, Ghulam Ahmad, Muhammad Zeeshan Anjum, Rizwan Khan, Muhammad Akmal

TL;DR

This work tackles motion blur in scene-text images, a problem that degrades readability and undermines downstream vision tasks. It introduces a CNN–ViT hybrid architecture that couples a CNN encoder–decoder with a Vision Transformer to capture local textures and global context, respectively. Trained on a TextOCR-derived dataset with realistic motion blur kernels, the model uses a composite loss combining MAE, MSE, perceptual, and SSIM terms to balance pixel accuracy, texture, and structure. The approach achieves state-of-the-art PSNR/SSIM with a compact 2.83M-parameter model and fast inference (~61 ms), highlighting its practicality for real-world, text-centric restoration in constrained environments.

Abstract

Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative evaluations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.

Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration

TL;DR

This work tackles motion blur in scene-text images, a problem that degrades readability and undermines downstream vision tasks. It introduces a CNN–ViT hybrid architecture that couples a CNN encoder–decoder with a Vision Transformer to capture local textures and global context, respectively. Trained on a TextOCR-derived dataset with realistic motion blur kernels, the model uses a composite loss combining MAE, MSE, perceptual, and SSIM terms to balance pixel accuracy, texture, and structure. The approach achieves state-of-the-art PSNR/SSIM with a compact 2.83M-parameter model and fast inference (~61 ms), highlighting its practicality for real-world, text-centric restoration in constrained environments.

Abstract

Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative evaluations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.

Paper Structure

This paper contains 20 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Architecture of the proposed deblurring model combining convolutional and transformer layers.
  • Figure 2: Illustration of a representative motion blur kernel (a) and its corresponding frequency-domain representation (b), computed via Fourier transform. The spectrum highlights the directional and non-uniform characteristics of the applied motion blur, consistent with real-world motion patterns.
  • Figure 3: Training and Validation Loss curves across 250 epochs.
  • Figure 4: PSNR evolution for training and validation sets.
  • Figure 5: Feature maps of early encoder layers (enc_conv1 to enc_conv3), capturing low-level features.
  • ...and 3 more figures