Table of Contents
Fetching ...

LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation

Yushi Sun, Xujia Li, Nan Tang, Quanqing Xu, Chuanhui Yang, Lei Chen

TL;DR

LakeHopper addresses cross data lake column type annotation by adapting a source annotator to a target data lake with minimal target annotations. It identifies the knowledge gap between source and target, selects informative weak samples via clustering, and uses an incremental, rehearsal-based fine-tuning strategy to gradually transfer capabilities while preserving shared knowledge. The approach leverages LLMs as domain-agnostic guides to calibrate PLM-based annotators, achieving high cross-lake generalizability and strong domain-specific accuracy with low adaptation cost. Empirical results on two data-lake transfers show significant improvements over state-of-the-art CTA methods in both low- and high-resource settings, and LakeHopper delivers substantial efficiency gains compared to fine-tuned TableLlama. The work suggests a practical pathway for reusing CTA models across diverse data lakes without extensive ground-truth labeling or expensive model retraining.

Abstract

Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.

LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation

TL;DR

LakeHopper addresses cross data lake column type annotation by adapting a source annotator to a target data lake with minimal target annotations. It identifies the knowledge gap between source and target, selects informative weak samples via clustering, and uses an incremental, rehearsal-based fine-tuning strategy to gradually transfer capabilities while preserving shared knowledge. The approach leverages LLMs as domain-agnostic guides to calibrate PLM-based annotators, achieving high cross-lake generalizability and strong domain-specific accuracy with low adaptation cost. Empirical results on two data-lake transfers show significant improvements over state-of-the-art CTA methods in both low- and high-resource settings, and LakeHopper delivers substantial efficiency gains compared to fine-tuned TableLlama. The work suggests a practical pathway for reusing CTA models across diverse data lakes without extensive ground-truth labeling or expensive model retraining.

Abstract

Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.
Paper Structure (41 sections, 1 equation, 6 figures, 14 tables, 1 algorithm)

This paper contains 41 sections, 1 equation, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) A cross data lakes CTA example. (b) Knowledge of fine-tuned models ($S$ and $T$ for source and target annotators) and generic models ($G$).
  • Figure 2: The System Architecture of LakeHopper.
  • Figure 3: An illustration of label set difference adjustment.
  • Figure 4: The parameter K's sensitivity of LakeHopper on two data lake transfers.
  • Figure 5: The LLM query verification template.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Example 1