Table of Contents
Fetching ...

SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang, Anxiang Zeng, Liujuan Cao

Abstract

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.

SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

Abstract

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.
Paper Structure (45 sections, 1 equation, 10 figures, 15 tables, 1 algorithm)

This paper contains 45 sections, 1 equation, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: Performance of representative models on SEA-Vision. (a) End-to-end text recognition performance for document parsing across 11 languages. (b) TEC-VQA accuracy by language and model, along with overall averages (Avg.).
  • Figure 2: SEA-Vision benchmark overview. Geographical language coverage and dataset scale (left), Document Parsing types and sample page attributes (middle), and TEC-VQA examples across consumer places and public spaces (right).
  • Figure 3: Overview of the data annotation pipelines. (a) Document Parsing Annotation Pipeline: Internet-sourced document pages are first collected using domain-specific keywords and filtered for quality. Metadata annotation includes layout detection and MLLM–based analysis for language and page type identification. Candidate pages are ranked by a rule-based scoring function considering block count, type diversity, text area ratio, and presence of figures or tables. Selected samples undergo region-level correction via specialized models for text, formulas, and tables, followed by final human verification. (b) TEC-VQA Annotation Pipeline: Scene images from diverse environments (e.g., public spaces, consumer places, documents) are gathered and filtered. Layout and text are detected and re-rendered with multilingual content. An MLLM first generates English QA pairs; the English questions are then translated into Chinese to obtain Chinese QA, which is aligned with the English QA for consistency. The resulting bilingual QA pairs are translated into the image language and manually verified.
  • Figure A1: Heatmap of the TEC-VQA capability category co-occurrence matrix. Color intensity indicates the frequency of QA pairs annotated with each capability combination.
  • Figure A2: Example prompt used for Multimodal Large Language Model (MLLM) TEC-VQA baselines. The document image and question are replaced with actual samples at inference time. (Pseudo example; not from the released dataset.)
  • ...and 5 more figures