Table of Contents
Fetching ...

Ensemble Learning for Vietnamese Scene Text Spotting in Urban Environments

Hieu Nguyen, Cong-Hoang Ta, Phuong-Thuy Le-Nguyen, Minh-Triet Tran, Trung-Nghia Le

TL;DR

This work tackles Vietnamese scene text spotting in urban environments by proposing an ensemble learning framework that fuses outputs from diverse detection and recognition models. The architecture comprises a data converter, multiple base models, and a meta-model that merges predictions through non-overlapping and overlapping text box handling using IoU-based rules. Experiments on the VinText benchmark show that carefully paired ensembles can outperform individual models, achieving performance gains up to several percentage points in key metrics, while also revealing trade-offs in computational cost. The findings demonstrate the practical potential of ensemble methods for robust Vietnamese text spotting in real-world urban scenes, with future work aimed at improving spelling accuracy and reducing computation.

Abstract

This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.

Ensemble Learning for Vietnamese Scene Text Spotting in Urban Environments

TL;DR

This work tackles Vietnamese scene text spotting in urban environments by proposing an ensemble learning framework that fuses outputs from diverse detection and recognition models. The architecture comprises a data converter, multiple base models, and a meta-model that merges predictions through non-overlapping and overlapping text box handling using IoU-based rules. Experiments on the VinText benchmark show that carefully paired ensembles can outperform individual models, achieving performance gains up to several percentage points in key metrics, while also revealing trade-offs in computational cost. The findings demonstrate the practical potential of ensemble methods for robust Vietnamese text spotting in real-world urban scenes, with future work aimed at improving spelling accuracy and reducing computation.

Abstract

This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.
Paper Structure (17 sections, 4 equations, 3 figures, 3 tables)

This paper contains 17 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Scene text spotting in Vietnamese urban environments poses various challenges, such as obscured by trees and perspective-shifted.
  • Figure 2: Workflow of the proposed ensemble learning framework for Vietnamese scene text spotting.
  • Figure 3: Visualization of results of our ensemble learning framework.