LLM-assisted Labeling Function Generation for Semantic Type Detection
Chenjie Li, Dan Zhang, Jin Wang
TL;DR
The paper tackles semantic type detection in data lake tables, a task hampered by the scarcity of high-quality labeled data. It proposes an end-to-end weak supervision pipeline that uses Large Language Models to automatically generate labeling functions, guided by carefully designed prompts, and a stacked Snorkel-based label model to manage a large label space. Key contributions include ground-truth-aware LF prompts, a scalable label-space partitioning strategy, and an empirical evaluation on Viznet and WikiTables showing feasible improvements with limited labeled data, along with insights into LF design and scalability. The results reveal meaningful gains for the end model despite gaps to fully supervised baselines, and they point to directions for richer LF types and more scalable label-models to handle complex semantic-type spaces. Overall, this work offers a practical, annotation-efficient pathway for semantic type detection and informs future research on LF generation and scalable weak supervision in table understanding.
Abstract
Detecting semantic types of columns in data lake tables is an important application. A key bottleneck in semantic type detection is the availability of human annotation due to the inherent complexity of data lakes. In this paper, we propose using programmatic weak supervision to assist in annotating the training data for semantic type detection by leveraging labeling functions. One challenge in this process is the difficulty of manually writing labeling functions due to the large volume and low quality of the data lake table datasets. To address this issue, we explore employing Large Language Models (LLMs) for labeling function generation and introduce several prompt engineering strategies for this purpose. We conduct experiments on real-world web table datasets. Based on the initial results, we perform extensive analysis and provide empirical insights and future directions for researchers in this field.
