Table of Contents
Fetching ...

RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector

Zhensheng Wang, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

TL;DR

RETQA introduces the first large-scale open-domain Chinese tabular QA dataset for the real estate domain, containing $4{,}932$ tables and $20{,}762$ QA pairs across 16 intents and 6 slot types. The authors propose the SLUTQA framework that combines SLU-labels with in-context learning to improve retrieval and answer generation over long, multi-table queries, demonstrating significant gains over vanilla baselines in both Markdown and SQL-style outputs across multiple LLMs. The contribution includes dataset construction (table collection, templates, and QA generation) and a novel retrieval-and-answer pipeline that handles open-domain, long-table QA without finetuning, with results showing meaningful improvements and practical relevance for real estate data QA. The work offers valuable benchmarks, methodological insights for integrating SLU with TQA, and will publicly release code and data to advance research in open-domain tabular QA.

Abstract

The real estate market relies heavily on structured data, such as property details, market trends, and price fluctuations. However, the lack of specialized Tabular Question Answering datasets in this domain limits the development of automated question-answering systems. To fill this gap, we introduce RETQA, the first large-scale open-domain Chinese Tabular Question Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762 question-answer pairs across 16 sub-fields within three major domains: property information, real estate company finance information and land auction information. Compared with existing tabular question answering datasets, RETQA poses greater challenges due to three key factors: long-table structures, open-domain retrieval, and multi-domain queries. To tackle these challenges, we propose the SLUTQA framework, which integrates large language models with spoken language understanding tasks to enhance retrieval and answering accuracy. Extensive experiments demonstrate that SLUTQA significantly improves the performance of large language models on RETQA by in-context learning. RETQA and SLUTQA provide essential resources for advancing tabular question answering research in the real estate domain, addressing critical challenges in open-domain and long-table question-answering. The dataset and code are publicly available at \url{https://github.com/jensen-w/RETQA}.

RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector

TL;DR

RETQA introduces the first large-scale open-domain Chinese tabular QA dataset for the real estate domain, containing tables and QA pairs across 16 intents and 6 slot types. The authors propose the SLUTQA framework that combines SLU-labels with in-context learning to improve retrieval and answer generation over long, multi-table queries, demonstrating significant gains over vanilla baselines in both Markdown and SQL-style outputs across multiple LLMs. The contribution includes dataset construction (table collection, templates, and QA generation) and a novel retrieval-and-answer pipeline that handles open-domain, long-table QA without finetuning, with results showing meaningful improvements and practical relevance for real estate data QA. The work offers valuable benchmarks, methodological insights for integrating SLU with TQA, and will publicly release code and data to advance research in open-domain tabular QA.

Abstract

The real estate market relies heavily on structured data, such as property details, market trends, and price fluctuations. However, the lack of specialized Tabular Question Answering datasets in this domain limits the development of automated question-answering systems. To fill this gap, we introduce RETQA, the first large-scale open-domain Chinese Tabular Question Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762 question-answer pairs across 16 sub-fields within three major domains: property information, real estate company finance information and land auction information. Compared with existing tabular question answering datasets, RETQA poses greater challenges due to three key factors: long-table structures, open-domain retrieval, and multi-domain queries. To tackle these challenges, we propose the SLUTQA framework, which integrates large language models with spoken language understanding tasks to enhance retrieval and answering accuracy. Extensive experiments demonstrate that SLUTQA significantly improves the performance of large language models on RETQA by in-context learning. RETQA and SLUTQA provide essential resources for advancing tabular question answering research in the real estate domain, addressing critical challenges in open-domain and long-table question-answering. The dataset and code are publicly available at \url{https://github.com/jensen-w/RETQA}.

Paper Structure

This paper contains 25 sections, 1 equation, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Example of SLU labels and the relation to open-domain TQA, where "D" represents district, "C" represents city, "M" represents month, "Y" represents year, and "DN" represents development name.
  • Figure 2: General framework of SLUTQA.
  • Figure 3: Showcase of Template Filling.
  • Figure 4: Showcase of a Example in the Dataset.
  • Figure 5: Visualizing Distribution of Rewrite Scores