Table of Contents
Fetching ...

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri, Hong Yu, Mehran Kazemi, Kumar Ayush, A. Ali Heydari, Maxwell A. Xu, Girish Narayanswamy, Yun Liu, Ming-Zher Poh, Yuzhe Yang, Mark Malhotra, Shwetak Patel, Hamid Palangi, Xuhai Xu, Daniel McDuff, Tim Althoff, Xin Liu

TL;DR

RADAR identifies a critical gap in evaluating language models on data-aware reasoning for imperfect tabular data. It delivers a scalable, artifact-focused benchmarking framework and a curated dataset with hundreds of perturbation-enabled QA tasks across diverse domains and table sizes. Empirical evaluation reveals that frontier models struggle significantly when data artifacts are present, even with code-based tool usage, underscoring the need for more robust, data-aware reasoning capabilities. The benchmark offers a foundational resource for diagnosing weaknesses and guiding the development of more reliable tabular reasoning agents.

Abstract

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

RADAR: Benchmarking Language Models on Imperfect Tabular Data

TL;DR

RADAR identifies a critical gap in evaluating language models on data-aware reasoning for imperfect tabular data. It delivers a scalable, artifact-focused benchmarking framework and a curated dataset with hundreds of perturbation-enabled QA tasks across diverse domains and table sizes. Empirical evaluation reveals that frontier models struggle significantly when data artifacts are present, even with code-based tool usage, underscoring the need for more robust, data-aware reasoning capabilities. The benchmark offers a foundational resource for diagnosing weaknesses and guiding the development of more reliable tabular reasoning agents.

Abstract

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

Paper Structure

This paper contains 20 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Reasoning over tabular data containing data artifacts and corresponding performance of language models.
  • Figure 2: Overview of Radar. Expert-written programmatic functions are used to: (1) generate ground truth answers (via answer functions invariant to table dimensions), and (2) simulate data artifacts by producing perturbed and recovered versions of the original table. We evaluate LMs on perturbed tables by computing the ground-truth answer over the corresponding recovered table, enabling a controlled and consistent evaluation across data artifact types and varying table sizes.
  • Figure 3: Data Artifact Types. Given a table $T$ without artifacts and a query $Q$ (e.g., "What is the average fare per mile?"), we perturb tables to simulate different data artifacts.
  • Figure 4: Data Statistics of Radar.
  • Figure 5: Frontier models struggle with logically inconsistent tables, despite clean-table success. Exact match scores on logically inconsistent tables on tasks where the model answered correctly on the clean table (indicated by N).
  • ...and 3 more figures