HebID: Detecting Social Identities in Hebrew-language Political Text
Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav
TL;DR
HebID tackles the lack of non‑English, multilingual resources for fine‑grained social identity detection in political text by introducing a survey‑grounded, multilabel Hebrew corpus of 5,536 sentences from Israeli politicians’ Facebook posts, labeled with twelve identities. It benchmarks encoder and decoder models across 2B–9B parameters, showing Hebrew‑tuned decoder LLMs (notably DictaLM2.0) achieving a macro‑F1 of 0.743 and strong cross‑genre generalization to Knesset speeches (0.72). The authors link Facebook discourse, parliamentary speeches, and public survey data to reveal identity prevalence, election‑cycle surges, and gender patterns, and they validate external validity against CHES‑Israel policy rankings with consistently significant correlations. HebID thus provides a foundational resource and methodological blueprint for studying identity discourse in Hebrew and non‑English political contexts, with broad applicability for sociolinguistic analysis and NLP benchmarks.
Abstract
Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.
