Reasoning-Guided Claim Normalization for Noisy Multilingual Social Media Posts
Manan Sharma, Arya Suneesh, Manish Jain, Pawan Kumar Rajpoot, Prasanna Devadiga, Bharatdeep Hazarika, Ashish Shrivastava, Kishan Gurumurthy, Anshuman B Suresh, Aditya U Baliga
TL;DR
The paper tackles multilingual claim normalization for misinformation across 20 languages by fine-tuning a multilingual generator (Qwen3-14B) with LoRA and 4-bit quantization, guided by a structured 5W1H reasoning framework and retrieval-augmented few-shot prompting. It introduces data cleaning via intra-post deduplication and token-level recall filtering, and augments training with explicit What/Who/Where/When/How/Why reasoning. The final English-centric training is evaluated cross-lingually on 13 languages (and zero-shot on 7), using METEOR as a primary metric, with strong cross-lingual transfer observed, especially for Romance languages, and substantial gains from the reasoning and retrieval components. The work demonstrates a scalable approach that maintains semantic coherence across languages and supports integration into fact-checking pipelines, with future work aimed at broader language coverage and platform-generalizable evaluation.
Abstract
We address claim normalization for multilingual misinformation detection - transforming noisy social media posts into clear, verifiable statements across 20 languages. The key contribution demonstrates how systematic decomposition of posts using Who, What, Where, When, Why and How questions enables robust cross-lingual transfer despite training exclusively on English data. Our methodology incorporates finetuning Qwen3-14B using LoRA with the provided dataset after intra-post deduplication, token-level recall filtering for semantic alignment and retrieval-augmented few-shot learning with contextual examples during inference. Our system achieves METEOR scores ranging from 41.16 (English) to 15.21 (Marathi), securing third rank on the English leaderboard and fourth rank for Dutch and Punjabi. The approach shows 41.3% relative improvement in METEOR over baseline configurations and substantial gains over existing methods. Results demonstrate effective cross-lingual generalization for Romance and Germanic languages while maintaining semantic coherence across diverse linguistic structures.
