Table of Contents
Fetching ...

GLEAN: Grounded Lightweight Evaluation Anchors for Contamination-Aware Tabular Reasoning

Qizhi Wang

TL;DR

GLEAN is proposed, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints, and is released with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.

Abstract

Tabular reasoning benchmarks mix semantic inference, numerical computation, and brittle table formatting, yet evaluations for small models remain vulnerable to contamination, dataset artifacts, and retrieval failures. We propose GLEAN, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints. We evaluate across TabFact, WTQ via Squall, TableBench, RobuT, and SciTab under a 16GB GPU budget. Using Squall gold SQL as an executable anchor (95.2% execution), GLEAN assigns a deterministic error taxonomy (L0-L4 plus L0.5 context miss) and reveals a stable error-mode separation: TAPEX errors skew toward grounding (L3) while TAPAS errors skew toward hallucination/abstention (L2/L0). We validate evidence-row heuristics against SQL-derived rows on simple queries (0.62 precision / 0.71 recall; hybrid recall 0.81) and show that retrieval Recall@K can saturate even when end-to-end EM/F1 remains limited, motivating attribution beyond raw recall. We release a modular framework with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.

GLEAN: Grounded Lightweight Evaluation Anchors for Contamination-Aware Tabular Reasoning

TL;DR

GLEAN is proposed, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints, and is released with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.

Abstract

Tabular reasoning benchmarks mix semantic inference, numerical computation, and brittle table formatting, yet evaluations for small models remain vulnerable to contamination, dataset artifacts, and retrieval failures. We propose GLEAN, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints. We evaluate across TabFact, WTQ via Squall, TableBench, RobuT, and SciTab under a 16GB GPU budget. Using Squall gold SQL as an executable anchor (95.2% execution), GLEAN assigns a deterministic error taxonomy (L0-L4 plus L0.5 context miss) and reveals a stable error-mode separation: TAPEX errors skew toward grounding (L3) while TAPAS errors skew toward hallucination/abstention (L2/L0). We validate evidence-row heuristics against SQL-derived rows on simple queries (0.62 precision / 0.71 recall; hybrid recall 0.81) and show that retrieval Recall@K can saturate even when end-to-end EM/F1 remains limited, motivating attribution beyond raw recall. We release a modular framework with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.
Paper Structure (40 sections, 7 figures, 14 tables)

This paper contains 40 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: GLEAN protocol pipeline: retrieval/pruning, model inference, and structured evaluation.
  • Figure 2: Squall QA performance on full 11,276.
  • Figure 3: SQL-anchored attribution distribution on Squall (full 11,276).
  • Figure 4: Key results summarized as bar charts.
  • Figure 5: WTQ-1k trade-off at 1024 tokens: Recall@10 (answer evidence) vs. TAPEX EM (matched subset).
  • ...and 2 more figures