Table of Contents
Fetching ...

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

Yoonshik Kim, Jaeyoon Jung

TL;DR

KOFFVQA addresses unreliable judgments in open-ended vision-language model evaluation and the lack of Korean-language benchmarks by introducing a general-purpose free-form VQA benchmark. It uses pre-defined, objective grading criteria fed to an LLM judge to score model responses, enabling reliable evaluation even for small open-source judges. The dataset comprises 275 Korean image-question pairs across 10 subcategories, with evaluation conducted on 47 VLMs, revealing that larger models do not necessarily outperform smaller ones and that the grading-criteria approach yields higher consistency than baseline comparisons. The study also demonstrates that including image input to judges can induce hallucinations, underscoring the advantage of language-only judging and highlighting avenues for improving judge training and benchmark design.

Abstract

The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

TL;DR

KOFFVQA addresses unreliable judgments in open-ended vision-language model evaluation and the lack of Korean-language benchmarks by introducing a general-purpose free-form VQA benchmark. It uses pre-defined, objective grading criteria fed to an LLM judge to score model responses, enabling reliable evaluation even for small open-source judges. The dataset comprises 275 Korean image-question pairs across 10 subcategories, with evaluation conducted on 47 VLMs, revealing that larger models do not necessarily outperform smaller ones and that the grading-criteria approach yields higher consistency than baseline comparisons. The study also demonstrates that including image input to judges can induce hallucinations, underscoring the advantage of language-only judging and highlighting avenues for improving judge training and benchmark design.

Abstract

The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA

Paper Structure

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Distribution of question categories and subcategories in the KOFFVQA benchmark.
  • Figure 2: Three examples from each main category of our benchmark. The left column is the original text in Korean, and the right column provides the English translation. Grading criteria paired with partial points are given to the judge model to evaluate the VLM's response.
  • Figure 3: An example of a response that GPT-4o grades correctly when the image is not given as input but grades incorrectly when the image is given. The left columns are the original text in Korean, and the right columns provide the English translations. When the image is given, the judge model attempts to judge the response based on the image and hallucinates that the door in the middle of the photograph is green. When the image is not given, the judge has no reason to grade the response based on anything other than the given criteria.