Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

Rufan Zhang; Lin Zhang; Xianghang Mi

Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

Rufan Zhang, Lin Zhang, Xianghang Mi

TL;DR

This work shows that in-context learning with foundation models can unify detection of toxic, spam, and negative content across binary, multi-class, and multi-label tasks without retraining. It demonstrates strong cross-task generalization, enabling high performance on benchmarks and a new Mastodon wild dataset, while offering lightweight personalization to block, unblock, or adapt to variations with minimal supervision. The study also reveals the value of rationale-enabled prompts for robustness to noisy real-world data and emphasizes the necessity of evaluating on wild data to capture domain shifts. Overall, the approach presents a privacy-preserving, user-centric pathway for next-generation content safety systems, with code and data publicly released to foster reproducibility.

Abstract

The proliferation of harmful online content--e.g., toxicity, spam, and negative sentiment--demands robust and adaptable moderation systems. However, prevailing moderation systems are centralized and task-specific, offering limited transparency and neglecting diverse user preferences--an approach ill-suited for privacy-sensitive or decentralized environments. We propose a novel framework that leverages in-context learning (ICL) with foundation models to unify the detection of toxicity, spam, and negative sentiment across binary, multi-class, and multi-label settings. Crucially, our approach enables lightweight personalization, allowing users to easily block new categories, unblock existing ones, or extend detection to semantic variations through simple prompt-based interventions--all without model retraining. Extensive experiments on public benchmarks (TextDetox, UCI SMS, SST2) and a new, annotated Mastodon dataset reveal that: (i) foundation models achieve strong cross-task generalization, often matching or surpassing task-specific fine-tuned models; (ii) effective personalization is achievable with as few as one user-provided example or definition; and (iii) augmenting prompts with label definitions or rationales significantly enhances robustness to noisy, real-world data. Our work demonstrates a definitive shift beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and highly adaptable pathway for the next generation of user-centric content safety systems. To foster reproducibility and facilitate future research, we publicly release our code on GitHub and the annotated Mastodon dataset on Hugging Face.

Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

TL;DR

Abstract

Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (27)