Table of Contents
Fetching ...

Bengali Document Layout Analysis -- A YOLOV8 Based Ensembling Approach

Nazmus Sakib Ahmed, Saad Sakib Noor, Ashraful Islam Shanto Sikder, Abhijit Paul

TL;DR

This work tackles Bengali Document Layout Analysis (DLA) by leveraging a YOLOv8m-seg–based ensemble with post-processing to handle the BaDLAD dataset's multi-class segmentation challenges. It introduces a two-model setup (a general model for all classes and a specialized image model) combined with convex hull-based post-processing and a memory-aware inference strategy to address mask gaps and CUDA out-of-memory issues. Through extensive validation-driven tuning and two-stage inference, the approach yields notable gains in DICE scores over individual architectures, signaling improved segmentation for paragraphs, text boxes, images, and tables and enhancing Bengali OCR readiness. The contributions deliver a practical, robust, and efficient Bengali DLA pipeline with strong potential for real-time document understanding and broader Bengali language processing tasks.

Abstract

This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using the YOLOv8 model and innovative post-processing techniques. We tackle challenges unique to the complex Bengali script by employing data augmentation for model robustness. After meticulous validation set evaluation, we fine-tune our approach on the complete dataset, leading to a two-stage prediction strategy for accurate element segmentation. Our ensemble model, combined with post-processing, outperforms individual base architectures, addressing issues identified in the BaDLAD dataset. By leveraging this approach, we aim to advance Bengali document analysis, contributing to improved OCR and document comprehension and BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field. Furthermore, our experiments provided key insights to incorporate new strategies into the established solution.

Bengali Document Layout Analysis -- A YOLOV8 Based Ensembling Approach

TL;DR

This work tackles Bengali Document Layout Analysis (DLA) by leveraging a YOLOv8m-seg–based ensemble with post-processing to handle the BaDLAD dataset's multi-class segmentation challenges. It introduces a two-model setup (a general model for all classes and a specialized image model) combined with convex hull-based post-processing and a memory-aware inference strategy to address mask gaps and CUDA out-of-memory issues. Through extensive validation-driven tuning and two-stage inference, the approach yields notable gains in DICE scores over individual architectures, signaling improved segmentation for paragraphs, text boxes, images, and tables and enhancing Bengali OCR readiness. The contributions deliver a practical, robust, and efficient Bengali DLA pipeline with strong potential for real-time document understanding and broader Bengali language processing tasks.

Abstract

This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using the YOLOv8 model and innovative post-processing techniques. We tackle challenges unique to the complex Bengali script by employing data augmentation for model robustness. After meticulous validation set evaluation, we fine-tune our approach on the complete dataset, leading to a two-stage prediction strategy for accurate element segmentation. Our ensemble model, combined with post-processing, outperforms individual base architectures, addressing issues identified in the BaDLAD dataset. By leveraging this approach, we aim to advance Bengali document analysis, contributing to improved OCR and document comprehension and BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field. Furthermore, our experiments provided key insights to incorporate new strategies into the established solution.
Paper Structure (19 sections, 2 figures, 2 tables)