TabReX : Tabular Referenceless eXplainable Evaluation
Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta
TL;DR
TabReX tackles the challenge of evaluating LLM-generated tabular outputs by introducing a reference-less, graph-based framework that converts text and tables into knowledge graphs, aligns them with an LLM-guided process, and produces rubric-aware scores that quantify structural and factual fidelity. The accompanying TabReX-Bench provides a large, perturbation-driven benchmark across six domains to stress test metric robustness and human alignment. Empirical results show TabReX achieves strong correlations with expert judgments, remains stable under hard perturbations, and offers fine-grained, explainable diagnostics at cell and table levels, enabling targeted model and prompt analysis. This work establishes a practical, interpretable paradigm for trustworthy evaluation of structured generation systems and points toward scalable, domain-adaptive evaluators for real-world applications.
Abstract
Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.
