Clustering Document Parts: Detecting and Characterizing Influence Campaigns from Documents
Zhengxiang Wang, Owen Rambow
TL;DR
This work reframes influence campaign detection as a clustering problem over document parts rather than a binary document-level task, enabling fine-grained and interpretable characterizations of campaigns. It introduces a four-stage pipeline—extracting document parts (including belief spans), clustering with SBERT embeddings (via KMeans and HDBSCAN with UMAP), classifying high-influence clusters, and projecting back to high-influence documents with aggregation across many clustering runs. Key contributions include using event factuality-derived spans as dense parts, demonstrating parts-based clustering outperforms document-level approaches, and proposing a cluster-aggregation strategy to stabilize performance without overfitting lexical cues. Evaluations on a DARPA INCAS-derived, multi-media dataset show systematic gains in precision and F1 over baselines and provide interpretable insights into which parts drive the influence-campaign classifications, with practical implications for scalable monitoring and analysis of influence operations.
Abstract
We propose a novel clustering pipeline to detect and characterize influence campaigns from documents. This approach clusters parts of document, detects clusters that likely reflect an influence campaign, and then identifies documents linked to an influence campaign via their association with the high-influence clusters. Our approach outperforms both the direct document-level classification and the direct document-level clustering approach in predicting if a document is part of an influence campaign. We propose various novel techniques to enhance our pipeline, including using an existing event factuality prediction system to obtain document parts, and aggregating multiple clustering experiments to improve the performance of both cluster and document classification. Classifying documents after clustering not only accurately extracts the parts of the documents that are relevant to influence campaigns, but also captures influence campaigns as a coordinated and holistic phenomenon. Our approach makes possible more fine-grained and interpretable characterizations of influence campaigns from documents.
