Table of Contents
Fetching ...

Crossing Domains without Labels: Distant Supervision for Term Extraction

Elena Senger, Yuri Campbell, Rob van der Goot, Barbara Plank

TL;DR

DiSTER addresses cross-domain Automatic Term Extraction by distilling knowledge from a black-box LLM into smaller open models trained on synthetically generated pseudo-labels, followed by post-hoc consistency adjustments. The method is evaluated on a seven-domain benchmark spanning biomedicine, corruption, dressage, heart failure, coastal geography, computational linguistics, and wind energy, using both corpus- and document-level metrics. Results show the DiSTER models outperform state-of-the-art sequence-labeling and few-shot baselines on most domains and approach the GPT-4o teacher's performance, with document-level consistency boosting F1 by up to 55 points. The work provides a scalable, annotation-free path to robust cross-domain ATE and releases the SynTerm dataset and fine-tuned models to support future research.

Abstract

Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark. The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.

Crossing Domains without Labels: Distant Supervision for Term Extraction

TL;DR

DiSTER addresses cross-domain Automatic Term Extraction by distilling knowledge from a black-box LLM into smaller open models trained on synthetically generated pseudo-labels, followed by post-hoc consistency adjustments. The method is evaluated on a seven-domain benchmark spanning biomedicine, corruption, dressage, heart failure, coastal geography, computational linguistics, and wind energy, using both corpus- and document-level metrics. Results show the DiSTER models outperform state-of-the-art sequence-labeling and few-shot baselines on most domains and approach the GPT-4o teacher's performance, with document-level consistency boosting F1 by up to 55 points. The work provides a scalable, annotation-free path to robust cross-domain ATE and releases the SynTerm dataset and fine-tuned models to support future research.

Abstract

Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark. The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.

Paper Structure

This paper contains 33 sections, 2 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: An overview of the key components of our DiSTER approach.
  • Figure 2: Conversation example showing extraction of domain-specific terms from an arXiv text. For data points without specific domain, like the ones coming from the Pile, we substitute the domain by "General".
  • Figure 3: Counts of unique terms among the considered datasets. Off-diagonal counts represent common terms.
  • Figure 4: Directional k-NN domain overlap score.
  • Figure 5: Directional k-NN overlap score for corpus-level terms across domains.