Table of Contents
Fetching ...

ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire

TL;DR

ArcheType introduces a practical, open‑ended framework for semantic column type annotation using large language models. By decomposing CTA into context sampling, prompt serialization, model querying, and label remapping, it achieves strong zero‑shot performance across diverse benchmarks and remains competitive in fine‑tuned settings with far less labeled data. The paper also introduces three new zero‑shot benchmarks (D4Tables, AmstrTables, PubchemTables) to probe domain specificity and distribution shift, and demonstrates that open‑source LLMs can approach or match closed‑source models on many tasks. The work highlights the importance of modular design and hyperparameterization (especially context sampling and label remapping) for robust, scalable CTA in real‑world data pipelines, and provides open‑source code to spur further research and deployment.

Abstract

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.

ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

TL;DR

ArcheType introduces a practical, open‑ended framework for semantic column type annotation using large language models. By decomposing CTA into context sampling, prompt serialization, model querying, and label remapping, it achieves strong zero‑shot performance across diverse benchmarks and remains competitive in fine‑tuned settings with far less labeled data. The paper also introduces three new zero‑shot benchmarks (D4Tables, AmstrTables, PubchemTables) to probe domain specificity and distribution shift, and demonstrates that open‑source LLMs can approach or match closed‑source models on many tasks. The work highlights the importance of modular design and hyperparameterization (especially context sampling and label remapping) for robust, scalable CTA in real‑world data pipelines, and provides open‑source code to spur further research and deployment.

Abstract

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.
Paper Structure (30 sections, 4 equations, 9 figures, 11 tables, 4 algorithms)

This paper contains 30 sections, 4 equations, 9 figures, 11 tables, 4 algorithms.

Figures (9)

  • Figure 1: ArcheType: a four-stage method for column type annotation. (1) In the Context Sampling stage, an algorithm selects a few representative samples from a column. (2) In the Prompt Serialization stage, the context and instruction string are serialized in a model-specific, token-efficient manner. (3) The prompt is input to a LLM in the Model Querying stage. (4) If the output of the LLM is not one of the allowable categories, the Label Remapping stage assigns the model output to a class.
  • Figure 2: Examples of ArcheType fine-tuned (top) and zero-shot (bottom) prompting.
  • Figure 3: Six prompt variations. In zero-shot ArcheType, we treat prompting as a hyperparameter, and sweep over six distinct prompts, each chosen according to a conceptual serialization strategy. <CLASSNAMES> stands in for the label set, <CONTEXT> for the output of the context sampling step. We use two variants of the "B" prompt, with semantic differences denoted by "|".
  • Figure 4: ArcheType sampling outperforms baseline methods. The sampling method used by Zero-shot ArcheType using different architectures (GPT, UL2, and T5) on the SOTAB-27 dataset, substantially outperforms simple random sampling (SRS) and first-k-entries sampling (FS), as used in kayali2023choruskorini2023column.
  • Figure 5: ArcheType performance is affected by context size and label remapping. The model benefits from increasing the context size from 3 to 10 samples. All methods outperform a baseline no-op method. CONTAINS+RESAMPLE performs best at every context scale.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Example
  • Example
  • Example
  • Example
  • definition 1: Fine-tuned LLM-CTA
  • definition 2: Zero-shot LLM-CTA
  • Example
  • Example