Table of Contents
Fetching ...

STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers

Daqian Shao, Lukas Fesser, Marta Kwiatkowska

TL;DR

STR-Cert tackles robustness certification for scene text recognition, a challenging image-based sequence prediction task, including Vision Transformer pipelines. It extends the DeepPoly polyhedral framework with novel bounds for components such as TPS, patch embedding, and the CTC/Softmax layers to certify STR models under $L_∞$ perturbations. The method certifies three STR architectures (CTC, attention, and ViTSTR) across six STR benchmarks, revealing scalability advantages of ViTSTR over LSTM-based decoders, especially for longer sequences. The work demonstrates practical safety guarantees for STR systems and points to future work in rotation robustness, branch-and-bound enhancements, and other perturbation norms.

Abstract

Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datasets such as MNIST. In this paper, we focus on the robustness certification of scene text recognition (STR), which is a complex and extensively deployed image-based sequence prediction problem. We tackle three types of STR model architectures, including the standard STR pipelines and the Vision Transformer. We propose STR-Cert, the first certification method for STR models, by significantly extending the DeepPoly polyhedral verification framework via deriving novel polyhedral bounds and algorithms for key STR model components. Finally, we certify and compare STR models on six datasets, demonstrating the efficiency and scalability of robustness certification, particularly for the Vision Transformer.

STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers

TL;DR

STR-Cert tackles robustness certification for scene text recognition, a challenging image-based sequence prediction task, including Vision Transformer pipelines. It extends the DeepPoly polyhedral framework with novel bounds for components such as TPS, patch embedding, and the CTC/Softmax layers to certify STR models under perturbations. The method certifies three STR architectures (CTC, attention, and ViTSTR) across six STR benchmarks, revealing scalability advantages of ViTSTR over LSTM-based decoders, especially for longer sequences. The work demonstrates practical safety guarantees for STR systems and points to future work in rotation robustness, branch-and-bound enhancements, and other perturbation norms.

Abstract

Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datasets such as MNIST. In this paper, we focus on the robustness certification of scene text recognition (STR), which is a complex and extensively deployed image-based sequence prediction problem. We tackle three types of STR model architectures, including the standard STR pipelines and the Vision Transformer. We propose STR-Cert, the first certification method for STR models, by significantly extending the DeepPoly polyhedral verification framework via deriving novel polyhedral bounds and algorithms for key STR model components. Finally, we certify and compare STR models on six datasets, demonstrating the efficiency and scalability of robustness certification, particularly for the Vision Transformer.
Paper Structure (40 sections, 41 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 40 sections, 41 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: An adversarial example from the IC15 dataset, where the predicted text changes under small $L_\infty$ perturbations. The original image is also shown to be not robust by STR-Cert.
  • Figure 2: Three types of common STR model architectures we consider in this work.
  • Figure 3: Polyhedral bounds for $f(1-|p_{ix}-m|)$ against $p_{ix}$ in the bilinear map of TPS.
  • Figure 4: Certification results with analysis against text length and adversarial training strength.
  • Figure 5: Example images and certification results for Softmax bounds and prediction confidence.