Table of Contents
Fetching ...

AIMM: An AI-Driven Multimodal Framework for Detecting Social-Media-Influenced Stock Market Manipulation

Sandeep Neela

TL;DR

AIMM addresses the rise of social-media-driven market manipulation by fusing Reddit-derived signals with OHLCV market data into a unified AMRS score. The framework extends the Stock-Pattern-Assistant by incorporating social volume, sentiment, bot-likeness, coordination, and market anomalies, and it uses a parquet-based pipeline plus a Streamlit dashboard for exploratory analysis. A key contribution is the AIMM-GT ground-truth dataset, along with forward-walk evaluation and prospective prediction logging to emulate real-time deployment. Early results on a small but carefully constructed dataset show strong ranking discrimination (ROC-AUC ~0.99) and the potential for multi-signal early warnings, exemplified by detecting the GME event days in advance. Limitations include the small sample size, reliance on synthetic social features due to Reddit data restrictions, and the need for broader validation before production deployment.

Abstract

Market manipulation now routinely originates from coordinated social media campaigns, not isolated trades. Retail investors, regulators, and brokerages need tools that connect online narratives and coordination patterns to market behavior. We present AIMM, an AI-driven framework that fuses Reddit activity, bot and coordination indicators, and OHLCV market features into a daily AIMM Manipulation Risk Score for each ticker. The system uses a parquet-native pipeline with a Streamlit dashboard that allows analysts to explore suspicious windows, inspect underlying posts and price action, and log model outputs over time. Due to Reddit API restrictions, we employ calibrated synthetic social features matching documented event characteristics; market data (OHLCV) uses real historical data from Yahoo Finance. This release makes three contributions. First, we build the AIMM Ground Truth dataset (AIMM-GT): 33 labeled ticker-days spanning eight equities, drawing from SEC enforcement actions, community-verified manipulation cases, and matched normal controls. Second, we implement forward-walk evaluation and prospective prediction logging for both retrospective and deployment-style assessment. Third, we analyze lead times and show that AIMM flagged GME 22 days before the January 2021 squeeze peak. The current labeled set is small (33 ticker-days, 3 positive events), but results show preliminary discriminative capability and early warnings for the GME incident. We release the code, dataset schema, and dashboard design to support research on social media-driven market surveillance.

AIMM: An AI-Driven Multimodal Framework for Detecting Social-Media-Influenced Stock Market Manipulation

TL;DR

AIMM addresses the rise of social-media-driven market manipulation by fusing Reddit-derived signals with OHLCV market data into a unified AMRS score. The framework extends the Stock-Pattern-Assistant by incorporating social volume, sentiment, bot-likeness, coordination, and market anomalies, and it uses a parquet-based pipeline plus a Streamlit dashboard for exploratory analysis. A key contribution is the AIMM-GT ground-truth dataset, along with forward-walk evaluation and prospective prediction logging to emulate real-time deployment. Early results on a small but carefully constructed dataset show strong ranking discrimination (ROC-AUC ~0.99) and the potential for multi-signal early warnings, exemplified by detecting the GME event days in advance. Limitations include the small sample size, reliance on synthetic social features due to Reddit data restrictions, and the need for broader validation before production deployment.

Abstract

Market manipulation now routinely originates from coordinated social media campaigns, not isolated trades. Retail investors, regulators, and brokerages need tools that connect online narratives and coordination patterns to market behavior. We present AIMM, an AI-driven framework that fuses Reddit activity, bot and coordination indicators, and OHLCV market features into a daily AIMM Manipulation Risk Score for each ticker. The system uses a parquet-native pipeline with a Streamlit dashboard that allows analysts to explore suspicious windows, inspect underlying posts and price action, and log model outputs over time. Due to Reddit API restrictions, we employ calibrated synthetic social features matching documented event characteristics; market data (OHLCV) uses real historical data from Yahoo Finance. This release makes three contributions. First, we build the AIMM Ground Truth dataset (AIMM-GT): 33 labeled ticker-days spanning eight equities, drawing from SEC enforcement actions, community-verified manipulation cases, and matched normal controls. Second, we implement forward-walk evaluation and prospective prediction logging for both retrospective and deployment-style assessment. Third, we analyze lead times and show that AIMM flagged GME 22 days before the January 2021 squeeze peak. The current labeled set is small (33 ticker-days, 3 positive events), but results show preliminary discriminative capability and early warnings for the GME incident. We release the code, dataset schema, and dashboard design to support research on social media-driven market surveillance.

Paper Structure

This paper contains 134 sections, 14 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: High-level AIMM system architecture, from ingestion to Streamlit deployment.
  • Figure 2: Overview of the AIMM-GT v2.0 ground-truth dataset used for evaluation. The dataset contains 33 labeled ticker-days across eight equities, consisting of three manipulation events and thirty matched normal controls. Metadata columns capture the manipulation type, confidence level, and source (SEC, community-verified, or synthetic negative examples). This dataset forms the basis for forward-walk and prospective validation experiments.
  • Figure 3: Forward-walk evaluation of AIMM using only data available prior to each labeled date. At the default threshold of 0.5, AIMM assigns low risk to all negative instances but misses several positive events, yielding conservative pointwise metrics. ROC-AUC (0.99) and PR-AUC (0.83) suggest promising ranking performance on this limited dataset, though confidence intervals are wide given the small sample size (n=3 positive events). These metrics should be interpreted cautiously as proof-of-concept rather than robust performance estimates. This evaluation avoids look-ahead bias and reflects real-time operational behavior.
  • Figure 4: Prospective evaluation results based on AIMM's live prediction log. Each prediction is timestamped and later matched with ground truth once labels become available. At the default threshold, AIMM achieves perfect precision, recall, and F1 on the subset of predictions with known labels. Although the sample size remains small, this confirms that AIMM's streaming pipeline—data ingestion, scoring, and logging—operates consistently with its offline evaluation.
  • Figure 5: Early-detection analysis for the January 28, 2021 GME incident. AIMM produces risk scores over a 45-day window centered on the event. Multiple pre-event alerts exceed the 0.55 threshold (e.g., January 6, 14, 19, and 25), with the earliest warning occurring approximately 22 days before the labeled manipulation date. This case study illustrates AIMM's ability to detect emerging social-market distortions prior to peak activity.
  • ...and 6 more figures