Table of Contents
Fetching ...

Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

Rufan Zhang, Lin Zhang, Xianghang Mi

TL;DR

This work shows that in-context learning with foundation models can unify detection of toxic, spam, and negative content across binary, multi-class, and multi-label tasks without retraining. It demonstrates strong cross-task generalization, enabling high performance on benchmarks and a new Mastodon wild dataset, while offering lightweight personalization to block, unblock, or adapt to variations with minimal supervision. The study also reveals the value of rationale-enabled prompts for robustness to noisy real-world data and emphasizes the necessity of evaluating on wild data to capture domain shifts. Overall, the approach presents a privacy-preserving, user-centric pathway for next-generation content safety systems, with code and data publicly released to foster reproducibility.

Abstract

The proliferation of harmful online content--e.g., toxicity, spam, and negative sentiment--demands robust and adaptable moderation systems. However, prevailing moderation systems are centralized and task-specific, offering limited transparency and neglecting diverse user preferences--an approach ill-suited for privacy-sensitive or decentralized environments. We propose a novel framework that leverages in-context learning (ICL) with foundation models to unify the detection of toxicity, spam, and negative sentiment across binary, multi-class, and multi-label settings. Crucially, our approach enables lightweight personalization, allowing users to easily block new categories, unblock existing ones, or extend detection to semantic variations through simple prompt-based interventions--all without model retraining. Extensive experiments on public benchmarks (TextDetox, UCI SMS, SST2) and a new, annotated Mastodon dataset reveal that: (i) foundation models achieve strong cross-task generalization, often matching or surpassing task-specific fine-tuned models; (ii) effective personalization is achievable with as few as one user-provided example or definition; and (iii) augmenting prompts with label definitions or rationales significantly enhances robustness to noisy, real-world data. Our work demonstrates a definitive shift beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and highly adaptable pathway for the next generation of user-centric content safety systems. To foster reproducibility and facilitate future research, we publicly release our code on GitHub and the annotated Mastodon dataset on Hugging Face.

Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

TL;DR

This work shows that in-context learning with foundation models can unify detection of toxic, spam, and negative content across binary, multi-class, and multi-label tasks without retraining. It demonstrates strong cross-task generalization, enabling high performance on benchmarks and a new Mastodon wild dataset, while offering lightweight personalization to block, unblock, or adapt to variations with minimal supervision. The study also reveals the value of rationale-enabled prompts for robustness to noisy real-world data and emphasizes the necessity of evaluating on wild data to capture domain shifts. Overall, the approach presents a privacy-preserving, user-centric pathway for next-generation content safety systems, with code and data publicly released to foster reproducibility.

Abstract

The proliferation of harmful online content--e.g., toxicity, spam, and negative sentiment--demands robust and adaptable moderation systems. However, prevailing moderation systems are centralized and task-specific, offering limited transparency and neglecting diverse user preferences--an approach ill-suited for privacy-sensitive or decentralized environments. We propose a novel framework that leverages in-context learning (ICL) with foundation models to unify the detection of toxicity, spam, and negative sentiment across binary, multi-class, and multi-label settings. Crucially, our approach enables lightweight personalization, allowing users to easily block new categories, unblock existing ones, or extend detection to semantic variations through simple prompt-based interventions--all without model retraining. Extensive experiments on public benchmarks (TextDetox, UCI SMS, SST2) and a new, annotated Mastodon dataset reveal that: (i) foundation models achieve strong cross-task generalization, often matching or surpassing task-specific fine-tuned models; (ii) effective personalization is achievable with as few as one user-provided example or definition; and (iii) augmenting prompts with label definitions or rationales significantly enhances robustness to noisy, real-world data. Our work demonstrates a definitive shift beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and highly adaptable pathway for the next generation of user-centric content safety systems. To foster reproducibility and facilitate future research, we publicly release our code on GitHub and the annotated Mastodon dataset on Hugging Face.

Paper Structure

This paper contains 39 sections, 4 equations, 27 figures, 8 tables.

Figures (27)

  • Figure 1: Prompt template for in-context learning (ICL).
  • Figure 2: The task description of single-task ICL (Toxicity).
  • Figure 3: The performance of ICL on Binary Spam Classification
  • Figure 4: The performance of ICL on Binary Sentiment Analysis
  • Figure 5: The performance of ICL on binary toxicity classification
  • ...and 22 more figures