Table of Contents
Fetching ...

TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta

TL;DR

TabReX tackles the challenge of evaluating LLM-generated tabular outputs by introducing a reference-less, graph-based framework that converts text and tables into knowledge graphs, aligns them with an LLM-guided process, and produces rubric-aware scores that quantify structural and factual fidelity. The accompanying TabReX-Bench provides a large, perturbation-driven benchmark across six domains to stress test metric robustness and human alignment. Empirical results show TabReX achieves strong correlations with expert judgments, remains stable under hard perturbations, and offers fine-grained, explainable diagnostics at cell and table levels, enabling targeted model and prompt analysis. This work establishes a practical, interpretable paradigm for trustworthy evaluation of structured generation systems and points toward scalable, domain-adaptive evaluators for real-world applications.

Abstract

Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

TabReX : Tabular Referenceless eXplainable Evaluation

TL;DR

TabReX tackles the challenge of evaluating LLM-generated tabular outputs by introducing a reference-less, graph-based framework that converts text and tables into knowledge graphs, aligns them with an LLM-guided process, and produces rubric-aware scores that quantify structural and factual fidelity. The accompanying TabReX-Bench provides a large, perturbation-driven benchmark across six domains to stress test metric robustness and human alignment. Empirical results show TabReX achieves strong correlations with expert judgments, remains stable under hard perturbations, and offers fine-grained, explainable diagnostics at cell and table levels, enabling targeted model and prompt analysis. This work establishes a practical, interpretable paradigm for trustworthy evaluation of structured generation systems and points toward scalable, domain-adaptive evaluators for real-world applications.

Abstract

Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

Paper Structure

This paper contains 46 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Metric Movements Across Difficulty Levels. Arrows show each metric’s shift from easy (blue) to hard (red) perturbations. Axes plot specificity (y) vs. sensitivity (x), with the green region denoting the balanced ideal zone. The dashed diagonal marks the optimal trade-off. TabReX stay near this zone, maintaining right direction even for hard examples.
  • Figure 2: Illustration of propsed TabReX . Both source text and generated tables are converted into knowledge graphs via Text2Graph and Table2Graph, aligned through an LLM-guided Graph Alignment, finally scored by a Property-Driven Scoring function that aggregates alignment statistics into interpretable, controllable table- and cell-level penalties.
  • Figure 3: Perturbation landscape across difficulty and type. The radial stacked donut visualizes the distribution of perturbation types segmented by difficulty: Easy (green), Medium (blue), and Hard (red). The top and bottom semicircles correspond to data-altering and data-preserving transformations, respectively.
  • Figure 4: Rubric-wise alignment across models and prompting strategies. Top row: cell-level agreement within model across prompts. Bottom row: table-level agreement. Model size and reasoning style influence local precision more than structural coherence, while prompt strategy (like Map&Make mapandmake) drives balanced alignment across rubric dimensions.