Table of Contents
Fetching ...

Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models

Zhanglin Wu, Tengfei Song, Ning Xie, Mengli Zhu, Weidong Zhang, Shuang Wu, Pengfei Li, Chong Li, Junhao Zhu, Hao Yang, Shiliang Sun

TL;DR

This work introduces MOTBench, a benchmark tailored to evaluate large vision-language models on long-text, complex-layout menu OCR and translation in both English and Chinese. It uses a fine-grained, one-by-one, item-level evaluation framework with $acc_1$, $acc_2$ for OCR and $BLEU$/$COMET$ for translation, tied to professional annotations on a dataset of 65 English and 65 Chinese menus (2,328 Chinese items and 1,467 English items). Experiments across 17 open-source and 7 closed-source LVLMs reveal wide performance gaps, showing that closed-source models generally outperform open-source ones, with notable differences between English and Chinese OCR and translation tasks. A consistency analysis demonstrates strong alignment between automatic MOTBench evaluations and human judgments, validating the framework's reliability and its potential to guide progress in LVLM development for real-world document understanding.

Abstract

The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.

Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models

TL;DR

This work introduces MOTBench, a benchmark tailored to evaluate large vision-language models on long-text, complex-layout menu OCR and translation in both English and Chinese. It uses a fine-grained, one-by-one, item-level evaluation framework with , for OCR and / for translation, tied to professional annotations on a dataset of 65 English and 65 Chinese menus (2,328 Chinese items and 1,467 English items). Experiments across 17 open-source and 7 closed-source LVLMs reveal wide performance gaps, showing that closed-source models generally outperform open-source ones, with notable differences between English and Chinese OCR and translation tasks. A consistency analysis demonstrates strong alignment between automatic MOTBench evaluations and human judgments, validating the framework's reliability and its potential to guide progress in LVLM development for real-world document understanding.

Abstract

The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.

Paper Structure

This paper contains 21 sections, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: An illustration comparing existing OCR benchmarks with our MOTBench.
  • Figure 2: Examples sampled from MOTBench are shown. The menu images are categorized into four groups: 'Simple Layout,' 'Text-Image Mixed,' 'Irregular Font,' and 'Real-World'. Both English and Chinese menus are included. For each menu image, the ground truth annotations include all dish items, their prices, and their corresponding translations. We employ two distinct prompts to evaluate OCR capability and translation capability separately.
  • Figure 3: The illustration of one-by-one comparison menu evaluation strategy in our MOTBench.
  • Figure 4: With human evaluation serving as the ground truth, the accuracy of both rule-based and LLM-based automatic evaluation results for the menu OCR task are measured. Here, accuracy is defined as the consistency between the automatic evaluation and human evaluation.
  • Figure 5: With human evaluation serving as the ground truth, the accuracy of both rule-based and LLM-based methods for automatically extracting dish translations in the menu translation task are measured. Here, accuracy is defined as the consistency between the automatic extraction and human evaluation.