POLAR:A Per-User Association Test in Embedding Space

Pedro Bento; Arthur Buzelin; Arthur Chagas; Yan Aquino; Victoria Estanislau; Samira Malaquias; Pedro Robles Dutenhefner; Gisele L. Pappa; Virgilio Almeida; Wagner MeiraJr

POLAR:A Per-User Association Test in Embedding Space

Pedro Bento, Arthur Buzelin, Arthur Chagas, Yan Aquino, Victoria Estanislau, Samira Malaquias, Pedro Robles Dutenhefner, Gisele L. Pappa, Virgilio Almeida, Wagner MeiraJr

Abstract

Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini--Hochberg control. On a balanced bot--human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at https://github.com/pedroaugtb/POLAR-A-Per-User-Association-Test-in-Embedding-Space.

POLAR:A Per-User Association Test in Embedding Space

Abstract

Paper Structure (50 sections, 6 equations, 5 figures, 3 tables)

This paper contains 50 sections, 6 equations, 5 figures, 3 tables.

Introduction
Related Work
User Representations in NLP and CSS
User Profiling and Bot Detection with Embeddings
Compact Conditioning for Personalization
Intrinsic Bias and Association Tests
Bots, Toxicity, and Political Discourse
Methodology
Attribute-Set Construction and Documentation
Data Construction and User Tokens
Training sketch
POLAR in the Learned Space
Statistic.
Null, $p$-value, and exchangeability.
Multiplicity Control.
...and 35 more sections

Figures (5)

Figure 1: Overview of POLAR, which first learns user vectors via masked language modeling with injected tokens (Phase 1), then computes association scores through cosine similarity and permutation testing against lexical attribute sets (Phase 2).
Figure 2: PCA projection of users in the 4D POLAR-axis score space. Marker shape encodes the gold label (bot vs. human), while color indicates the dominant axis (largest $|s|$ for that user). Centroids (black $\times$) summarize average positions; PCA is used for visualization, while classification is performed in the original axis space (Tables \ref{['tab:axis-bots']}--\ref{['tab:fox8-oof']}).
Figure 3: Box-and-swarm plots of per-user POLAR scores $s(u;\mathcal{A},\mathcal{B})$ (y-axis; Eq. \ref{['eq:userweat']}) for humans and bots across four lexical axes. Points are individual users; boxplots summarize median and interquartile range.
Figure 4: Stormfront per-user POLAR associations on sensitive targets and policy frames.
Figure 5: Stormfront dynamics: per-user scatterplots mark least- (blue) and most-aligned (red) accounts on four slur axes; coloured trajectories track selected users with the fastest drift toward alignment.

POLAR:A Per-User Association Test in Embedding Space

Abstract

POLAR:A Per-User Association Test in Embedding Space

Abstract

Table of Contents

Figures (5)