Table of Contents
Fetching ...

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

TL;DR

The paper tackles scalable evaluation of chart-understanding with LVLM judges in resource-constrained settings. It introduces two cost-efficient strategies—multi-criteria prompting and domain-adaptive transfer learning to train a tiny evaluator (ChartJudge-2B)—to enable practical deployment. ChartJudge-2B demonstrates cross-dataset transfer and often outperforms larger LVLM judges in single- and multi-criteria settings, while exposing robustness gaps in bigger models under prompting. The work provides actionable deployment guidance and releases code and data to support low-cost, real-world chart reasoning evaluation.

Abstract

Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks.

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

TL;DR

The paper tackles scalable evaluation of chart-understanding with LVLM judges in resource-constrained settings. It introduces two cost-efficient strategies—multi-criteria prompting and domain-adaptive transfer learning to train a tiny evaluator (ChartJudge-2B)—to enable practical deployment. ChartJudge-2B demonstrates cross-dataset transfer and often outperforms larger LVLM judges in single- and multi-criteria settings, while exposing robustness gaps in bigger models under prompting. The work provides actionable deployment guidance and releases code and data to support low-cost, real-world chart reasoning evaluation.

Abstract

Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks.

Paper Structure

This paper contains 38 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: GPT-4o as the judge evaluating Claude-3-Haiku model answer in the OpenCQA dataset across multiple criteria (informativeness, factual correctness).
  • Figure 2: Our Fine-Tuning Approach. (a) At first, responses generated by various LVLMs (e.g., GPT-4, Phi-3) for chart captioning (e.g., Chart-to-Text dataset) are judged on diverse criteria by a different LVLM (e.g., Gemini-1.5-Pro). A small LVLM (e.g., Qwen2-VL-2B-Instruct) is then fine-tuned on these judgments to create ChartJudge-2B. (b) For evaluation, responses from LVLMs (e.g., Claude, Gemini) on chart benchmarks are judged by ChartJudge-2B and compared with a larger LVLM (e.g., GPT-4o) or human ratings.
  • Figure 3: Ablation results for the ChartJudge-2B model in OpenCQA via training (i) without query, (ii) only pointwise samples, (iii) only pairwise samples, and (iv) merged version of only pairwise and pointwise models.
  • Figure 4: An example of an error case involves the PaliGemma-3b model being tasked with evaluating a chart caption generated by another model. Specifically, it was asked to rate the caption on a scale of 1 to 5 based on the 'Informativeness' criterion and to provide an explanation for the rating. However, instead of performing the evaluation correctly, the model hallucinated and repeatedly generated the same line without adhering to the required JSON format. (highlighted in red text).
  • Figure 5: An example of an error case for the LLaVA-Critic-7B model which demonstrates position bias by changing its selection of the better caption based on the change in the order of the model generated captions.
  • ...and 1 more figures