Table of Contents
Fetching ...

Adaptive Originality Filtering: Rejection Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation

Duy Le, Kent Ziti, Evan Girard-Sun, Bakr Bouhaya, Sean O'Brien, Vasu Sharma, Kevin Zhu

TL;DR

This work tackles multilingual riddle generation by introducing Adaptive Originality Filtering (AOF), a prompting framework that enforces semantic novelty and cultural fidelity through a cosine-similarity rejection loop, coupled with a composite RiddleScore for evaluation. RiddleScore blends Novelty, Diversity, Fluency, and Semantic Alignment using lightweight back-end models and is calibrated to align with human judgments across languages. Empirically, AOF improves diversity and reduces repetition (Self-BLEU) while elevating creativity and cultural grounding across English, Chinese, Arabic, Japanese, and French when applied to GPT-4o, LLaMA 3.1, and DeepSeek R1, with notable gains when the GPT-4o model is fine-tuned on BiRdQA data. The work demonstrates that semantic-filtering prompts can meaningfully enhance culturally grounded, cross-lingual creativity without requiring full model fine-tuning, and offers a pathway to applying these techniques to broader creative tasks. The combination of AOF and RiddleScore provides a practical, scalable framework for evaluating and improving multilingual, figurative text generation in real-world applications.

Abstract

Language models are increasingly tested on multilingual creativity, demanding culturally grounded, abstract generations. Standard prompting methods often produce repetitive or shallow outputs. We introduce Adaptive Originality Filtering (AOF), a prompting strategy that enforces novelty and cultural fidelity via semantic rejection. To assess quality, we propose RiddleScore, a metric combining novelty, diversity, fluency, and answer alignment. AOF improves Distinct-2 (0.915 in Japanese), reduces Self-BLEU (0.177), and raises RiddleScore (up to +57.1% in Arabic). Human evaluations confirm fluency, creativity, and cultural fit gains. However, improvements vary: Arabic shows greater RiddleScore gains than Distinct-2; Japanese sees similar changes. Though focused on riddles, our method may apply to broader creative tasks. Overall, semantic filtering with composite evaluation offers a lightweight path to culturally rich generation without fine-tuning.

Adaptive Originality Filtering: Rejection Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation

TL;DR

This work tackles multilingual riddle generation by introducing Adaptive Originality Filtering (AOF), a prompting framework that enforces semantic novelty and cultural fidelity through a cosine-similarity rejection loop, coupled with a composite RiddleScore for evaluation. RiddleScore blends Novelty, Diversity, Fluency, and Semantic Alignment using lightweight back-end models and is calibrated to align with human judgments across languages. Empirically, AOF improves diversity and reduces repetition (Self-BLEU) while elevating creativity and cultural grounding across English, Chinese, Arabic, Japanese, and French when applied to GPT-4o, LLaMA 3.1, and DeepSeek R1, with notable gains when the GPT-4o model is fine-tuned on BiRdQA data. The work demonstrates that semantic-filtering prompts can meaningfully enhance culturally grounded, cross-lingual creativity without requiring full model fine-tuning, and offers a pathway to applying these techniques to broader creative tasks. The combination of AOF and RiddleScore provides a practical, scalable framework for evaluating and improving multilingual, figurative text generation in real-world applications.

Abstract

Language models are increasingly tested on multilingual creativity, demanding culturally grounded, abstract generations. Standard prompting methods often produce repetitive or shallow outputs. We introduce Adaptive Originality Filtering (AOF), a prompting strategy that enforces novelty and cultural fidelity via semantic rejection. To assess quality, we propose RiddleScore, a metric combining novelty, diversity, fluency, and answer alignment. AOF improves Distinct-2 (0.915 in Japanese), reduces Self-BLEU (0.177), and raises RiddleScore (up to +57.1% in Arabic). Human evaluations confirm fluency, creativity, and cultural fit gains. However, improvements vary: Arabic shows greater RiddleScore gains than Distinct-2; Japanese sees similar changes. Though focused on riddles, our method may apply to broader creative tasks. Overall, semantic filtering with composite evaluation offers a lightweight path to culturally rich generation without fine-tuning.

Paper Structure

This paper contains 128 sections, 2 equations, 9 figures, 32 tables, 1 algorithm.

Figures (9)

  • Figure 1: End-to-end pipeline to produce and verify riddles with LLMs (GPT-4o, R1, LLaMA). Constraints enforce novelty/structure; MiniLM tests semantic similarity with threshold $\leq 0.75$. Failed results are re-generated; accepted ones are subjected to final checking.
  • Figure 2: AOF rejection-sampling loop. Each candidate is generated, compared to reference riddles, and either accepted, rejected, or retried up to $k$ attempts.
  • Figure 3: RiddleScore components and weights ($\alpha{=}0.30,\beta{=}0.20,\gamma{=}0.30,\delta{=}0.20$).
  • Figure 4: Correlation between fine-tuning gains in RiddleScore and human evaluation scores across five languages. Each point represents one language; higher values correspond to more improvement compared to the pretrained model.
  • Figure 5: Percentage changes in RiddleScore, Self-BLEU, Distinct-2, and human evaluation after fine-tuning. Positive bars show improvements; negative Self-BLEU values (in red) indicate desirable reductions in repetition.
  • ...and 4 more figures