Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Md Tahmid Rahman Laskar; Mohammed Saidul Islam; Ridwan Mahbub; Mizanur Rahman; Amran Bhuiyan; Israt Jahan; Mir Tafseer Nayeem; Shafiq Joty; Enamul Hoque; Jimmy Huang

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

TL;DR

The paper tackles scalable evaluation of chart-understanding with LVLM judges in resource-constrained settings. It introduces two cost-efficient strategies—multi-criteria prompting and domain-adaptive transfer learning to train a tiny evaluator (ChartJudge-2B)—to enable practical deployment. ChartJudge-2B demonstrates cross-dataset transfer and often outperforms larger LVLM judges in single- and multi-criteria settings, while exposing robustness gaps in bigger models under prompting. The work provides actionable deployment guidance and releases code and data to support low-cost, real-world chart reasoning evaluation.

Abstract

Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks.

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

TL;DR

Abstract

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)