Table of Contents
Fetching ...

Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning

Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang, Surajit Chaudhuri

TL;DR

This work proposes a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data.

Abstract

In this work, we propose Table-LLM-Specialist, or Table-Specialist for short, as a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data. Our extensive evaluations suggest that our Table-Specialist has (1) \textit{strong performance} on diverse table tasks over vanilla language-models -- for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) \textit{lower cost} to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency and inference cost, with comparable quality, and (3) \textit{better generalizability} when evaluated across multiple benchmarks, since \sys is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code and data will be available at https://github.com/microsoft/Table-LLM-Specialist.

Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning

TL;DR

This work proposes a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data.

Abstract

In this work, we propose Table-LLM-Specialist, or Table-Specialist for short, as a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data. Our extensive evaluations suggest that our Table-Specialist has (1) \textit{strong performance} on diverse table tasks over vanilla language-models -- for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) \textit{lower cost} to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency and inference cost, with comparable quality, and (3) \textit{better generalizability} when evaluated across multiple benchmarks, since \sys is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code and data will be available at https://github.com/microsoft/Table-LLM-Specialist.

Paper Structure

This paper contains 13 sections, 2 theorems, 14 figures, 9 tables, 3 algorithms.

Key Result

Proposition 1

Figures (14)

  • Figure 1: Performance vs. generalizability trade-offs: A visual comparison of different fine-tuning approaches for table-tasks. (1) Dataset-specific fine-tuning: models are fine-tuned on benchmark "training split" of one dataset, which performs well on the corresponding "test split" (but may not generalize to a different datasets for the same task type). (2) Table-Specialist fine-tuning (this work): we propose to fine-tune one model per table-task (e.g., data cleaning, data transformation, etc.), which generalizes well across datasets for the same task type. (3) Table-Generalist fine-tuning: methods that fine-tune one general-purpose model to handle many different table-tasks, which has good generalizability, at the cost of lower-performance on individual tasks.
  • Figure 2: "Dataset-specific fine-tuning" using GPT-3.5 for table-task $T$: (a) Schema-matching, (b) NL-to-SQL. In both cases, while GPT-3.5 fine-tuned using the training-split of one dataset $D$ lead to performance gains on the test-split of the same $D$ (shown as green arrows pointing up), they also result in significant performance loss on another dataset $D'$ for the same task type $T$ (red arrows down), relative to un-tuned vanilla GPT-3.5, suggesting likely over-fitting on $D$.
  • Figure 3: "Table-Specialist fine-tuning": Quality vs. latency comparison on two table-tasks: (a) NL-to-R; (b) NL-to-Scala. In both cases, Table-Specialist-GPT-3.5 significantly outperforms vanilla GPT-3.5, and can even outperforms vanilla GPT-4 (shown on y-axis), making it possible to deploy Table-Specialist-GPT-3.5 over vanilla GPT-4 for these tasks, at substantially lower latency and costs (x-axis).
  • Figure 4: Example table-tasks: Error-detection and NL-to-SQL
  • Figure 5: Architecture of Table-Specialist using "Generator-Validator" fine-tuning for a given task type $T$ (Error-detection in this example). (1) A real table $R$ is sampled from a corpus of diverse tables; (2) Table $R$ is used to instantiate an instance of the generative table task $T_G(R)$ (left box); (3) A "Generator model" $M_G$ (initially a vanilla language-model) is used to generate completion for $T_G(R)$, in this case a possible typo error "Missisipi"; (4) The completion "Missisipi" is inserted into $R$, and used to instantiate a classification-version of the Error-detection task $T_C$ (right box), which is validated by a "Validator model" for the classification task $M_C$ (initially also a vanilla language-model). If $M_C$ consistently produces "Missisipi" for $T_C$, then "Missisipi" is considered validated (i.e., likely a real error); (5-6) Validated training data is then used to re-train the Generator $M_G$ and Validator $M_C$, for more effective Generator and Validator models. We iteratively fine-tune $M_G$ and $M_C$, by repeating steps (1)-(6) .
  • ...and 9 more figures

Theorems & Definitions (10)

  • Definition 1
  • Example 1
  • Definition 2
  • Example 2
  • Proposition 1
  • Example 3
  • Example 4
  • Proposition 2
  • Example 5
  • Example 6