Ideology-Based LLMs for Content Moderation
Stefano Civelli, Pietro Bernardelle, Nardiena A. Pratama, Gianluca Demartini
TL;DR
This work examines how persona-based conditioning steers LLMs in content moderation across architectures, sizes, and modalities. By mapping synthetic personas onto a two-dimensional political compass and selecting extreme positions, the authors study baseline moderation, ideological sensitivity, agreement patterns, and partisan bias using six open-source instruction-tuned LLMs on text and multimodal hate-speech datasets. They find that while overall accuracy remains stable, persona conditioning induces systematic ideological effects: different leanings alter the likelihood of labeling content as harmful, and larger models exhibit stronger intra-ideology cohesion and partisan biases, especially in politically charged tasks. The results highlight that persona prompting can inadvertently encode and amplify ideological biases, raising fairness concerns for AI-powered moderation and motivating further work on robustness and mitigation strategies. These insights are significant for deploying AI in sensitive moderation contexts, as they reveal subtle biases that may be dressed as neutrality but reflect in-group protections and out-group skepticism.
Abstract
Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model "views" input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.
