Table of Contents
Fetching ...

Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data

Julian Schelb, Roberto Ulloa, Andreas Spitz

TL;DR

This study tackles scalable topic classification of German policy content in web data by comparing fine-tuned encoder models to in-context learning with limited labels. It uses three German policy topics and analyzes both URL-only and URL+content features, finding that fine-tuning yields superior performance, while content signals consistently boost accuracy. GELECTRA-Large with URL+content achieves the strongest overall results, though model size and data quality critically influence performance on noisier, real-world data; zero- and few-shot prompting can approach this level but generally lag behind supervised training. The work provides practical guidance for researchers analyzing information exposure in large-scale web data and highlights the importance of data quality, sampling strategies, and robust evaluation under noisy conditions for political and social science applications.

Abstract

Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.

Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data

TL;DR

This study tackles scalable topic classification of German policy content in web data by comparing fine-tuned encoder models to in-context learning with limited labels. It uses three German policy topics and analyzes both URL-only and URL+content features, finding that fine-tuning yields superior performance, while content signals consistently boost accuracy. GELECTRA-Large with URL+content achieves the strongest overall results, though model size and data quality critically influence performance on noisier, real-world data; zero- and few-shot prompting can approach this level but generally lag behind supervised training. The work provides practical guidance for researchers analyzing information exposure in large-scale web data and highlights the importance of data quality, sampling strategies, and robust evaluation under noisy conditions for political and social science applications.

Abstract

Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.
Paper Structure (42 sections, 3 figures, 9 tables)

This paper contains 42 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Webpage processing and classification pipeline. The extracted webpage content is divided into chunks, maintaining the original labels. Chunk level predictions are aggregated to obtain the final label per URL.
  • Figure 2: Prompt template for zero- and few-shot classification. General task instruction and the incomplete example are consistent across all experiments. For few-shot experiments, $k$ additional demonstrators are included (see Appendix \ref{['appendix:policy_descriptions']} for details).
  • Figure 3: Precision-recall curves for GELECTRA-Large across topics on the Complete test set. Cannabis shows the highest precision-recall performance and Energy the lowest (recall that the number of webpages varies between the topics).