From Chaos to Clarity: Claim Normalization to Empower Fact-Checking
Megha Sundriyal, Tanmoy Chakraborty, Preslav Nakov
TL;DR
This work defines Claim Normalization (ClaimNorm) to distill a social media post's central, verifiable assertion, addressing the gap between noisy content and fact-checking needs. It introduces CACN, a chain-of-thought and check-worthiness aware framework that leverages in-context learning with large language models to generate concise normalized claims, and it introduces CLAN, a real-world dataset of 6,388 post–normalized-claim pairs sourced from the Google Fact-Check Explorer and ClaimReview Schema. Experiments show CACN outperforms strong baselines across lexical and semantic metrics, with prompt tuning and in-context learning delivering substantial gains, while zero-shot performance demonstrates notable inherent capabilities. The work discusses limitations, data biases, and environmental considerations, and outlines future directions including multilingual and multimodal extensions to broaden impact in automated fact-checking pipelines.
Abstract
With the rise of social media, users are exposed to many misleading claims. However, the pervasive noise inherent in these posts presents a challenge in identifying precise and prominent claims that require verification. Extracting the important claims from such posts is arduous and time-consuming, yet it is an underexplored problem. Here, we aim to bridge this gap. We introduce a novel task, Claim Normalization (aka ClaimNorm), which aims to decompose complex and noisy social media posts into more straightforward and understandable forms, termed normalized claims. We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation, mimicking human reasoning processes, to comprehend intricate claims. Moreover, we capitalize on the in-context learning capabilities of large language models to provide guidance and to improve claim normalization. To evaluate the effectiveness of our proposed model, we meticulously compile a comprehensive real-world dataset, CLAN, comprising more than 6k instances of social media posts alongside their respective normalized claims. Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures. Finally, our rigorous error analysis validates CACN's capabilities and pitfalls.
