Table of Contents
Fetching ...

Can Humans Identify Domains?

Maria Barrett, Max Müller-Eberstein, Elisa Bassignana, Amalie Brogaard Pauli, Mike Zhang, Rob van der Goot

TL;DR

This work examines the elusive notion of textual domain by studying human ability to identify genre and topic from text, using the TGeGUM dataset—a multi-layer extension of GUM with 9.1k sentences annotated for 11 genres and 10/100 topics by three annotators per instance. Through exploratory data analysis and multiple modeling approaches, the authors show that humans achieve substantial agreement on genre with context, while topic identification remains more challenging, especially at finer granularity, and that no single discrete domain captures all variability. Automatic models (DeBERTa-Large) align more with gold genre labels than human majority on prose, yet distributional modeling improves some measures over majority votes, highlighting that modeling annotation distributions can better reflect human uncertainty. Overall, the study argues for treating domain as a probabilistic, continuous space rather than a fixed set of categories and emphasizes the critical role of context in genre discrimination and the limited consensus on topic delineation, with implications for domain-aware NLP systems and evaluation frameworks.

Abstract

Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance. The concept itself is, however, loosely defined and, in practice, refers to any non-typological property, such as genre, topic, medium or style of a document. We investigate the core notion of domains via human proficiency in identifying related intrinsic textual properties, specifically the concepts of genre (communicative purpose) and topic (subject matter). We publish our annotations in *TGeGUM*: A collection of 9.1k sentences from the GUM dataset (Zeldes, 2017) with single sentence and larger context (i.e., prose) annotations for one of 11 genres (source type), and its topic/subtopic as per the Dewey Decimal library classification system (Dewey, 1979), consisting of 10/100 hierarchical topics of increased granularity. Each instance is annotated by three annotators, for a total of 32.7k annotations, allowing us to examine the level of human disagreement and the relative difficulty of each annotation task. With a Fleiss' kappa of at most 0.53 on the sentence level and 0.66 at the prose level, it is evident that despite the ubiquity of domains in NLP, there is little human consensus on how to define them. By training classifiers to perform the same task, we find that this uncertainty also extends to NLP models.

Can Humans Identify Domains?

TL;DR

This work examines the elusive notion of textual domain by studying human ability to identify genre and topic from text, using the TGeGUM dataset—a multi-layer extension of GUM with 9.1k sentences annotated for 11 genres and 10/100 topics by three annotators per instance. Through exploratory data analysis and multiple modeling approaches, the authors show that humans achieve substantial agreement on genre with context, while topic identification remains more challenging, especially at finer granularity, and that no single discrete domain captures all variability. Automatic models (DeBERTa-Large) align more with gold genre labels than human majority on prose, yet distributional modeling improves some measures over majority votes, highlighting that modeling annotation distributions can better reflect human uncertainty. Overall, the study argues for treating domain as a probabilistic, continuous space rather than a fixed set of categories and emphasizes the critical role of context in genre discrimination and the limited consensus on topic delineation, with implications for domain-aware NLP systems and evaluation frameworks.

Abstract

Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance. The concept itself is, however, loosely defined and, in practice, refers to any non-typological property, such as genre, topic, medium or style of a document. We investigate the core notion of domains via human proficiency in identifying related intrinsic textual properties, specifically the concepts of genre (communicative purpose) and topic (subject matter). We publish our annotations in *TGeGUM*: A collection of 9.1k sentences from the GUM dataset (Zeldes, 2017) with single sentence and larger context (i.e., prose) annotations for one of 11 genres (source type), and its topic/subtopic as per the Dewey Decimal library classification system (Dewey, 1979), consisting of 10/100 hierarchical topics of increased granularity. Each instance is annotated by three annotators, for a total of 32.7k annotations, allowing us to examine the level of human disagreement and the relative difficulty of each annotation task. With a Fleiss' kappa of at most 0.53 on the sentence level and 0.66 at the prose level, it is evident that despite the ubiquity of domains in NLP, there is little human consensus on how to define them. By training classifiers to perform the same task, we find that this uncertainty also extends to NLP models.
Paper Structure (37 sections, 1 equation, 16 figures, 5 tables)

This paper contains 37 sections, 1 equation, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Graphical illustration of our triple-annotation setup with gold genre labels.
  • Figure 2: Frequency distributions of the labels in gold genre labels, annotations of genres, annotations of topic-10, and annotations of topic-100 (log scale) on sentence level. For the human annotations, the number is divided by three in order to align with the (unique) gold label. The mapping of topic-10 and topic-100 labels can be found in \ref{['sec:labels']}. The tag "No" in the topic annotations refers to no-topic.
  • Figure 3: Confusion matrix with all annotated pairs of labels for Genre and Topic-10 (across all annotators) in our training data: The darker the color, the higher the number of annotations for that label pair. The diagonal can be seen as agreement, whereas off-diagonal is a proxy for disagreement.
  • Figure 4: Frequency of sentence lengths, measured by the number of characters, per gold genre.
  • Figure 5: The target value each model variant is trained to predict: 1) Majority vote. 2) PerLabelRegr(ession) on label distributions. 3) PerLabel-Class(ification), on score bins per label. 4) PerAnnotator, three different annotations.
  • ...and 11 more figures