Table of Contents
Fetching ...

ZTab: Domain-based Zero-shot Annotation for Table Columns

Ehsan Hoseinzade, Ke Wang

Abstract

This study addresses the challenge of automatically detecting semantic column types in relational tables, a key task in many real-world applications. Zero-shot modeling eliminates the need for user-provided labeled training data, making it ideal for scenarios where data collection is costly or restricted due to privacy concerns. However, existing zero-shot models suffer from poor performance when the number of semantic column types is large, limited understanding of tabular structure, and privacy risks arising from dependence on high-performance closed-source LLMs. We introduce ZTab, a domain-based zero-shot framework that addresses both performance and zero-shot requirements. Given a domain configuration consisting of a set of predefined semantic types and sample table schemas, ZTab generates pseudo-tables for the sample schemas and fine-tunes an annotation LLM on them. ZTab is domain-based zero-shot in that it does not depend on user-specific labeled training data; therefore, no retraining is needed for a test table from a similar domain. We describe three cases of domain-based zero-shot. The domain configuration of ZTab provides a trade-off between the extent of zero-shot and annotation performance: a "universal domain" that contains all semantic types approaches "pure" zero-shot, while a "specialized domain" that contains semantic types for a specific application enables better zero-shot performance within that domain. Source code and datasets are available at https://github.com/hoseinzadeehsan/ZTab

ZTab: Domain-based Zero-shot Annotation for Table Columns

Abstract

This study addresses the challenge of automatically detecting semantic column types in relational tables, a key task in many real-world applications. Zero-shot modeling eliminates the need for user-provided labeled training data, making it ideal for scenarios where data collection is costly or restricted due to privacy concerns. However, existing zero-shot models suffer from poor performance when the number of semantic column types is large, limited understanding of tabular structure, and privacy risks arising from dependence on high-performance closed-source LLMs. We introduce ZTab, a domain-based zero-shot framework that addresses both performance and zero-shot requirements. Given a domain configuration consisting of a set of predefined semantic types and sample table schemas, ZTab generates pseudo-tables for the sample schemas and fine-tunes an annotation LLM on them. ZTab is domain-based zero-shot in that it does not depend on user-specific labeled training data; therefore, no retraining is needed for a test table from a similar domain. We describe three cases of domain-based zero-shot. The domain configuration of ZTab provides a trade-off between the extent of zero-shot and annotation performance: a "universal domain" that contains all semantic types approaches "pure" zero-shot, while a "specialized domain" that contains semantic types for a specific application enables better zero-shot performance within that domain. Source code and datasets are available at https://github.com/hoseinzadeehsan/ZTab
Paper Structure (23 sections, 4 figures, 11 tables, 2 algorithms)

This paper contains 23 sections, 4 figures, 11 tables, 2 algorithms.

Figures (4)

  • Figure 1: Comparison between (a) pure zero-shot column type annotation and (b) domain-based zero-shot ZTab. ZTab takes a class list and table schema collection as inputs, generates class prototypes (1) and pseudo-tables based on class prototypes (2), and fine-tunes an annotation LLM (3).
  • Figure 2: Prompt for target column $t_i$ in a table with $n$ columns and $k$ rows.
  • Figure 3: Top 50 classes in WikiTable dataset ranked by improvement of ZTab-performance (GPT-4.1-mini,GPT-4.1-mini) over baseline CENTS (GPT-4.1-mini)
  • Figure 4: Effect of fine-tuning epochs on micro F1 score (avg. over datasets).