Table of Contents
Fetching ...

Optimization Techniques for Unsupervised Complex Table Reasoning via Self-Training Framework

Zhenyu Li, Xiuxing Li, Sunqi Fan, Jianyong Wang

TL;DR

This work tackles the challenge of limited labeled data for complex tabular reasoning by proposing UCTR-ST, a unified framework that synthesizes diverse, program-driven samples and leverages self-training to exploit unlabeled data. It combines Program-Management, Program-Transformation, and Table-Text Manipulator to generate joint table-text reasoning data and to bridge program outputs to natural language, enabling reasoning tasks across homogeneous and heterogeneous data. Comprehensive experiments on FEVEROUS, TAT-QA, WiKiSQL, and SEM-TAB-FACTS show that synthetic data plus self-training can approach supervised performance and substantially boost low-resource domains. The approach also serves as a data augmentation technique for supervised models, reducing annotation costs while maintaining robust cross-domain applicability.

Abstract

Structured tabular data is a fundamental data type in numerous fields, and the capacity to reason over tables is crucial for answering questions and validating hypotheses. However, constructing labeled data for complex reasoning tasks is labor intensive, and the quantity of annotated data remains insufficient to support the intricate demands of real-world applications. To address the insufficient annotation challenge, we present a self-training framework for unsupervised complex tabular reasoning (UCTR-ST) by generating diverse synthetic data with complex logic. Specifically, UCTR-ST incorporates several essential techniques: we aggregate diverse programs and execute them on tables based on a "Program-Management" component, and we bridge the gap between programs and text with a powerful "Program-Transformation" module that generates natural language sentences with complex logic. Furthermore, we optimize the procedure using a "Table-Text Manipulator" to handle joint table-text reasoning scenarios. The entire framework utilizes self-training techniques to leverage the unlabeled training data, which results in significant performance improvements when tested on real-world data. Experimental results demonstrate that UCTRST achieves above 90% of the supervised model performance on different tasks and domains, reducing the dependence on manual annotation. Additionally, our approach can serve as a data augmentation technique, significantly boosting the performance of supervised models in low-resourced domains.

Optimization Techniques for Unsupervised Complex Table Reasoning via Self-Training Framework

TL;DR

This work tackles the challenge of limited labeled data for complex tabular reasoning by proposing UCTR-ST, a unified framework that synthesizes diverse, program-driven samples and leverages self-training to exploit unlabeled data. It combines Program-Management, Program-Transformation, and Table-Text Manipulator to generate joint table-text reasoning data and to bridge program outputs to natural language, enabling reasoning tasks across homogeneous and heterogeneous data. Comprehensive experiments on FEVEROUS, TAT-QA, WiKiSQL, and SEM-TAB-FACTS show that synthetic data plus self-training can approach supervised performance and substantially boost low-resource domains. The approach also serves as a data augmentation technique for supervised models, reducing annotation costs while maintaining robust cross-domain applicability.

Abstract

Structured tabular data is a fundamental data type in numerous fields, and the capacity to reason over tables is crucial for answering questions and validating hypotheses. However, constructing labeled data for complex reasoning tasks is labor intensive, and the quantity of annotated data remains insufficient to support the intricate demands of real-world applications. To address the insufficient annotation challenge, we present a self-training framework for unsupervised complex tabular reasoning (UCTR-ST) by generating diverse synthetic data with complex logic. Specifically, UCTR-ST incorporates several essential techniques: we aggregate diverse programs and execute them on tables based on a "Program-Management" component, and we bridge the gap between programs and text with a powerful "Program-Transformation" module that generates natural language sentences with complex logic. Furthermore, we optimize the procedure using a "Table-Text Manipulator" to handle joint table-text reasoning scenarios. The entire framework utilizes self-training techniques to leverage the unlabeled training data, which results in significant performance improvements when tested on real-world data. Experimental results demonstrate that UCTRST achieves above 90% of the supervised model performance on different tasks and domains, reducing the dependence on manual annotation. Additionally, our approach can serve as a data augmentation technique, significantly boosting the performance of supervised models in low-resourced domains.
Paper Structure (30 sections, 9 equations, 6 figures, 8 tables, 2 algorithms)

This paper contains 30 sections, 9 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: The previous study chemmengath2021topic shows performance of models degrades dramatically on topics not seen during the training stage.
  • Figure 2: The comparison of simple claims and complex claims. A simple claim only involves a specific table cell, but a complex claim requires the annotator to consider the relationship among multiple cells.
  • Figure 3: Illustration of our framework. The left part depicts how we generate synthetic samples (enclosed in the dashed box). Specifically, the Table-To-Text operator focuses on splitting the original table into a sub-table and a generated sentence and then building a joint table-text reasoning sample based on the basic modules. The Text-To-Table operator adopts a similar procedure but aggregates information from the original table and text to form an expanded table. The right part show how we apply the self-training technique. In each iteration, the teacher model infers on the unlabeled data to generate pseudo-labeled data, which is then used to train a better student model.
  • Figure 4: Examples of three types of programs we used in this work: logical forms, SQL queries, and arithmetic expressions. The Program-Transformation module transforms a logical form into a claim and transforms the other two types of programs into questions.
  • Figure 5: Effectiveness of the synthetic data. The orange line corresponds to the model first trained on the synthetic data and then fine-tuned on the varied number of labeled samples. The blue line corresponds to the model directly trained on labeled samples.
  • ...and 1 more figures