Table of Contents
Fetching ...

Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese

Yuichi Inoue, Kento Sasaki, Yuma Ochi, Kazuki Fujii, Kotaro Tanahashi, Yu Yamaguchi

TL;DR

The paper addresses the lack of Japanese-focused evaluation benchmarks for Vision Language Models by introducing Japanese Heron-Bench, a dataset of 21 images with 102 Japanese-context questions across seven subcategories, and a baseline Japanese VLM trained via visual instruction tuning. Evaluation leverages GPT-4 as an oracle to score both VLM and reference answers, enabling cross-model comparisons against strong closed models like GPT-4V and open VLMs. Key contributions include publicly releasing the benchmark dataset, training code, and a competitive Japanese VLM (Heron GIT), along with a thorough analysis of strengths and gaps in Japanese VLM capabilities. The work advances culturally aware evaluation of multimodal models and guides future development toward more capable and contextually aligned Japanese VLMs.

Abstract

Vision Language Models (VLMs) have undergone a rapid evolution, giving rise to significant advancements in the realm of multimodal understanding tasks. However, the majority of these models are trained and evaluated on English-centric datasets, leaving a gap in the development and evaluation of VLMs for other languages, such as Japanese. This gap can be attributed to the lack of methodologies for constructing VLMs and the absence of benchmarks to accurately measure their performance. To address this issue, we introduce a novel benchmark, Japanese Heron-Bench, for evaluating Japanese capabilities of VLMs. The Japanese Heron-Bench consists of a variety of imagequestion answer pairs tailored to the Japanese context. Additionally, we present a baseline Japanese VLM that has been trained with Japanese visual instruction tuning datasets. Our Heron-Bench reveals the strengths and limitations of the proposed VLM across various ability dimensions. Furthermore, we clarify the capability gap between strong closed models like GPT-4V and the baseline model, providing valuable insights for future research in this domain. We release the benchmark dataset and training code to facilitate further developments in Japanese VLM research.

Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese

TL;DR

The paper addresses the lack of Japanese-focused evaluation benchmarks for Vision Language Models by introducing Japanese Heron-Bench, a dataset of 21 images with 102 Japanese-context questions across seven subcategories, and a baseline Japanese VLM trained via visual instruction tuning. Evaluation leverages GPT-4 as an oracle to score both VLM and reference answers, enabling cross-model comparisons against strong closed models like GPT-4V and open VLMs. Key contributions include publicly releasing the benchmark dataset, training code, and a competitive Japanese VLM (Heron GIT), along with a thorough analysis of strengths and gaps in Japanese VLM capabilities. The work advances culturally aware evaluation of multimodal models and guides future development toward more capable and contextually aligned Japanese VLMs.

Abstract

Vision Language Models (VLMs) have undergone a rapid evolution, giving rise to significant advancements in the realm of multimodal understanding tasks. However, the majority of these models are trained and evaluated on English-centric datasets, leaving a gap in the development and evaluation of VLMs for other languages, such as Japanese. This gap can be attributed to the lack of methodologies for constructing VLMs and the absence of benchmarks to accurately measure their performance. To address this issue, we introduce a novel benchmark, Japanese Heron-Bench, for evaluating Japanese capabilities of VLMs. The Japanese Heron-Bench consists of a variety of imagequestion answer pairs tailored to the Japanese context. Additionally, we present a baseline Japanese VLM that has been trained with Japanese visual instruction tuning datasets. Our Heron-Bench reveals the strengths and limitations of the proposed VLM across various ability dimensions. Furthermore, we clarify the capability gap between strong closed models like GPT-4V and the baseline model, providing valuable insights for future research in this domain. We release the benchmark dataset and training code to facilitate further developments in Japanese VLM research.
Paper Structure (19 sections, 5 figures, 5 tables)

This paper contains 19 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of evaluation results using the Japanese-translated LLaVA Bench (In the Wild) and the Japanese Heron-Bench.
  • Figure 2: Overview of collected images for evaluation per subcategory and scoring process. The dataset consists of seven categories relevant to the Japanese context. The GPT-4 API is used to evaluate and score the answers provided by both the VLMs and GPT-4. The context works as a reference for scoring the answers.
  • Figure 3: Comparison of scores of GPT-4V (closed model), Heron GIT (Japanese VLM), and LLaVA-1.6 (English VLM) across subcategories. Box plots display raw scores of each model.
  • Figure 4: Comparison of scores for representative questions in each category. Raw scores of GPT-4V, Heron GIT, and LLaVA-1.6 for three representative questions from each category are shown.
  • Figure 5: Variability in scores across five GPT-4 API calls for each model. Bars represent average scores. Individual scores are also shown. Multiple evaluations can provide more accurate results when average scores between models are close.