Adaptive Originality Filtering: Rejection Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation
Duy Le, Kent Ziti, Evan Girard-Sun, Bakr Bouhaya, Sean O'Brien, Vasu Sharma, Kevin Zhu
TL;DR
This work tackles multilingual riddle generation by introducing Adaptive Originality Filtering (AOF), a prompting framework that enforces semantic novelty and cultural fidelity through a cosine-similarity rejection loop, coupled with a composite RiddleScore for evaluation. RiddleScore blends Novelty, Diversity, Fluency, and Semantic Alignment using lightweight back-end models and is calibrated to align with human judgments across languages. Empirically, AOF improves diversity and reduces repetition (Self-BLEU) while elevating creativity and cultural grounding across English, Chinese, Arabic, Japanese, and French when applied to GPT-4o, LLaMA 3.1, and DeepSeek R1, with notable gains when the GPT-4o model is fine-tuned on BiRdQA data. The work demonstrates that semantic-filtering prompts can meaningfully enhance culturally grounded, cross-lingual creativity without requiring full model fine-tuning, and offers a pathway to applying these techniques to broader creative tasks. The combination of AOF and RiddleScore provides a practical, scalable framework for evaluating and improving multilingual, figurative text generation in real-world applications.
Abstract
Language models are increasingly tested on multilingual creativity, demanding culturally grounded, abstract generations. Standard prompting methods often produce repetitive or shallow outputs. We introduce Adaptive Originality Filtering (AOF), a prompting strategy that enforces novelty and cultural fidelity via semantic rejection. To assess quality, we propose RiddleScore, a metric combining novelty, diversity, fluency, and answer alignment. AOF improves Distinct-2 (0.915 in Japanese), reduces Self-BLEU (0.177), and raises RiddleScore (up to +57.1% in Arabic). Human evaluations confirm fluency, creativity, and cultural fit gains. However, improvements vary: Arabic shows greater RiddleScore gains than Distinct-2; Japanese sees similar changes. Though focused on riddles, our method may apply to broader creative tasks. Overall, semantic filtering with composite evaluation offers a lightweight path to culturally rich generation without fine-tuning.
