Table of Contents
Fetching ...

Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection

Dmitri Roussinov, Serge Sharoff, Nadezhda Puchnina

TL;DR

This paper investigates the out-of-domain performance gap of large language models under in-context learning for non-topical tasks, focusing on genre classification and generated-text detection. It proposes a prompt-control method that specifies which predictive indicators to use or ignore, emphasizing stylistic features over topical content. Across GPT-4o and Claude models, this approach reduces the OOD gap by up to 20 percentage points in few-shot settings and outperforms baseline Chain-of-Thought prompts, with ablations confirming the value of explicit feature guidance. The work highlights practical domain-transfer improvements and points to future work in multilingual settings and ethical considerations around bias and misuse of prompt-control techniques.

Abstract

This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: 1) genre classification and 2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance.

Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection

TL;DR

This paper investigates the out-of-domain performance gap of large language models under in-context learning for non-topical tasks, focusing on genre classification and generated-text detection. It proposes a prompt-control method that specifies which predictive indicators to use or ignore, emphasizing stylistic features over topical content. Across GPT-4o and Claude models, this approach reduces the OOD gap by up to 20 percentage points in few-shot settings and outperforms baseline Chain-of-Thought prompts, with ablations confirming the value of explicit feature guidance. The work highlights practical domain-transfer improvements and points to future work in multilingual settings and ethical considerations around bias and misuse of prompt-control techniques.

Abstract

This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: 1) genre classification and 2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance.
Paper Structure (24 sections, 2 figures, 10 tables)

This paper contains 24 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Domain transfer assessment methodology adapted from roussinov23emnlp for few-shot In-Context Learning (ICL), independently testing two tasks: (1) genre classification and (2) generated text detection. Prompt construction may optionally include instructions on which indicators to use or ignore. The topic modeling scores determine which texts are considered on-topic (top) or off-topic (bottom). On-topic texts are used for testing and, depending on the configuration, for ICL demonstration examples, while off-topic texts are only used as examples. Synthetic texts are generated by an LLM for the generated text detection task. This methodology is applicable to other non-topical classification tasks, such as determining gender, identifying authorship, analyzing sentiment, etc.
  • Figure 2: Accuracy comparison between GPT-4o baseline and detailed control prompts across different numbers of demonstration examples (shots).