Table of Contents
Fetching ...

Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution

Sandrine Chausson, Björn Ross

TL;DR

The paper introduces a domain-agnostic, few-shot approach for detecting statements by modeling classes as taxonomies of claims and leveraging NLI scores with Probabilistic Bisection to adapt per-claim thresholds with minimal annotation. It replaces traditional fine-tuning with a data-efficient pipeline that uses threshold-tuning on a small annotated sample, enabling theory-grounded, interpretable classifications across diverse tasks. Evaluations on climate contrarian detection, topic/stance classification, and depressive symptom detection show competitive performance, often with substantially fewer annotations than strong baselines. The method emphasizes transparency and transferability, and the authors provide a public GitHub resource to facilitate adoption and extension in Computational Social Science research.

Abstract

Many tasks related to Computational Social Science and Web Content Analysis involve classifying pieces of text based on the claims they contain. State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. In light of this, we propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. This methodology involves defining the classes as arbitrarily sophisticated taxonomies of claims, and using Natural Language Inference models to obtain the textual entailment between these and a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the well-established statistical heuristic of Probabilistic Bisection. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection. This approach rivals traditional pre-train/fine-tune approaches while drastically reducing the need for data annotation.

Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution

TL;DR

The paper introduces a domain-agnostic, few-shot approach for detecting statements by modeling classes as taxonomies of claims and leveraging NLI scores with Probabilistic Bisection to adapt per-claim thresholds with minimal annotation. It replaces traditional fine-tuning with a data-efficient pipeline that uses threshold-tuning on a small annotated sample, enabling theory-grounded, interpretable classifications across diverse tasks. Evaluations on climate contrarian detection, topic/stance classification, and depressive symptom detection show competitive performance, often with substantially fewer annotations than strong baselines. The method emphasizes transparency and transferability, and the authors provide a public GitHub resource to facilitate adoption and extension in Computational Social Science research.

Abstract

Many tasks related to Computational Social Science and Web Content Analysis involve classifying pieces of text based on the claims they contain. State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. In light of this, we propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. This methodology involves defining the classes as arbitrarily sophisticated taxonomies of claims, and using Natural Language Inference models to obtain the textual entailment between these and a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the well-established statistical heuristic of Probabilistic Bisection. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection. This approach rivals traditional pre-train/fine-tune approaches while drastically reducing the need for data annotation.
Paper Structure (22 sections, 1 equation, 3 figures, 10 tables)

This paper contains 22 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of our proposed methodology.
  • Figure 2: Probability distribution over threshold location for the claim 1.1.2.0: "Greenland is gaining ice". The red line shows the median of the current distribution, which defines the next datapoint to be selected for annotation. Over successive iterations of the Probabilistic Bisection Algorithm, the probability mass concentrates around the optimal threshold for that claim. Note that the y-axis for the $16^{th}$ annotation is different from other figures.
  • Figure 3: Average distance from the true optimal threshold, as calculated from the entire training set, per round of annotation and associated standard error. The standard error is calculated from the standard deviation from the average distance across all claims, and the number of claims still being annotated at that given round of annotation.