Efficient Multi-domain Text Recognition Deep Neural Network Parameterization with Residual Adapters
Jiayou Chao, Wei Zhu
TL;DR
The paper addresses the challenge of performing OCR across multiple domains under data and compute constraints by introducing a parameter-efficient multi-domain architecture that leverages residual adapters in the CNN backbone and bottleneck adapters in a transformer-based sequence model. The backbone is kept fixed while adapters are trained to enable domain-specific refinement, guided by Connectionist Temporal Classification loss, achieving substantial parameter reductions (around 64.9%) with competitive performance across simple and challenging domains. Key contributions include a practical framework for rapid domain adaptation, continual learning without forgetting, and extensive evaluation on a Chinese multi-domain OCR benchmark demonstrating scalability and robustness. This approach has significant practical impact for deploying adaptable OCR systems in real-world, resource-constrained settings where multiple domains must be handled efficiently.
Abstract
Recent advancements in deep neural networks have markedly enhanced the performance of computer vision tasks, yet the specialized nature of these networks often necessitates extensive data and high computational power. Addressing these requirements, this study presents a novel neural network model adept at optical character recognition (OCR) across diverse domains, leveraging the strengths of multi-task learning to improve efficiency and generalization. The model is designed to achieve rapid adaptation to new domains, maintain a compact size conducive to reduced computational resource demand, ensure high accuracy, retain knowledge from previous learning experiences, and allow for domain-specific performance improvements without the need to retrain entirely. Rigorous evaluation on open datasets has validated the model's ability to significantly lower the number of trainable parameters without sacrificing performance, indicating its potential as a scalable and adaptable solution in the field of computer vision, particularly for applications in optical text recognition.
